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(57) Abstract 

The present imention provides ;i method for generating a 
representation of the extent of relaredness between at least two 
clas.se.s of cells. The invetition also provides a method for generating 
a representation of the correlation between a first class of cells and a 
second class of cells. The coirelation retlects a change in the nature 
and amount of nucleic ;icids present in the classes. In these methods, 
the cells in each clas.s are chosen from among cells of a give a cell 
t\pe, cells from a given tissue, and cells from ;i given organ. The 
methods establish similarities or differences between the classes by 
derining a plurality of pairs of nucleotide .subsequences, each pair 
consisting of a brst subsequence and a second subsequence, and. 
in the nucleic acid of each class of cells, determining the presence 
of a fragment with the tirst subsecpience uv one end and the second 
subsequence at another end and having ;i length separated by the 
liisi and second subseqtiences. as well as ;i quantitation of the extent 
to which each fragment is present. The methods then determine the 
e.Ktent of reiatedness reflecting the similarities or differences among 
the classes. The invention further provides display means displaying 
a representation of the extent of reiatedness between the classes of 
cells, and displaying a representarioti of the coirelation between 
the hrst class of cells and tlic second class of cells. Additionally, 
the invention provides a representation of the extent of lelatedness 
between the classes of cells, and represetitation of the coirelation 
between the first class of cells and the second class of cells. 
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GEOiMETRICAL AND HIERARCHICAL CLASSIFICATION BASED ON 

GENE EXPRESSION 

FIFLD OF THi: IN\ ENTION 

This invention relates to representations of the extent of relatcdness between cells, eel! lines, 
tissues, organs, or expressed sequences based on a genomic anal} sis of gene expression using software 
algorithm based analysis. 

RELATED APPLICATIONS 

This application claims priority to both United States Application Serial Number . 

filed September 16, 1^90. entitled "GEOMCTRiCAI, AND Hir[<.-\RCHICAL, Ci.ASSiFtCATloN BascdON 
Gene Expression", and United States Provisional Application Serial Number 60-M 01.009 filed 
September 17, 1998. entitled "PHVLOGENOMICS AND Pll ARMAcX>GENOM[C S*\ uhich are incorporated 
herein b\' reference in their entiretN . 

BACKGROUND OF THE INVENTION 

The rapid de\elopn)ent of genomics and proteomics in recent \ears has led to a burgeoning ot* 
applications making use of the new information pro\. ided. A significant area in w hich sucli information 
has been put to use is in the grouping and characterization of pathological states according to the 
differential expression of genes in such states. A corollar\' application is in grouping and characterizing; 
the therapeutic effects of know n or candidate pharmaceutical agents used in treating various pathologies. 
Algorithms emplo\ing a \'ariel\ of statistical procedures lunc been employed to create heuristic displays 
of the information obtained from such analyses. These display s include large two dimensional, or e\ en 
hinher dimensional arra\ s in which the elements are coded, for example by false color coding, to 
represent a particular experimental result. .Alternatixe displa\s include those in which the experimental 
data is tised to genera^^ cladistic or radiating tree structures as a representation of relatedness. 
Furthermore, it is also possible to use similar methods to group expressed sequences according to 
patterns of co-expression o\ er several different biological states. 

For example, a system of cluster analysis for genome-wide expression in the yeast 
Saccharomyccs cerevis'uic and in primarx human fibroblasts has been presented by Eisen ct ai [Proc. 
Sail. Acad. Sci. USA 95:14863-14868 (1998)). In the \east work. DNA microchip arra>s carrying 
essentialK every ORF from this organism were used. DitYcrential expression was studied by varying the 
plusiological state, including the diauxic shift, the mitotic cell di\ ision c\cle, sporulr.aon, and 
temperature and reducing shocks. The human finrohlasis \^cre stimulated with serum following serum 
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Starvation, and examined using a microarra) witii "^.SOO cDNAs representing approximately 8.600 
distinct human transcripts. Additionally, a further independent variable in these exporitnents is the time 
at wliich an assay point was taken. Data reflecting the differential gene expression in the various studies 
were analyzed using pairwise average-linkage cluster analysis (Sokal ei a!.. Univ. Kcins. Sci. Bull. 
38:1409-1438 (1958)), uiiich was used to compute a dendrogram that assembles all elements into a 
single tree, 

Colon adenocarcinoma from 40 tumor samples were compared with 22 normal colon tissue 
samples using Aff>metri\ DNA chips to \shicli sequences from human cDNAs were bound (Alon cV al.. 
Proc. \atl. Acad. Sci. USA 96:6745-6750 (June 1999)). 3.200 full-length human cDNAs and 3.400 EST 
are represented in sets of 25-bp fragments, as well as such sequences containing a single base mismatch 
in the center of the sequence. The gene expression in both the tumor tissue samples and the normal 
colon samples, uas assessed by hybridization. The statistical significance of the correlation heiueen 
uenes was assessed by calculating pairw ise correlation coeftlcients. I'he clustering of the expressed 
aenes was evaluated using an algorithm based on deterministic-annealing ( Rose c7 nL, Phys. Rcw Lcii. 
65:945-048 (1990): Rose. Proc. IEEE 96: 2210-2239 ( 1998)) to organize the data in a binar> tree. Data 
are presented as a large two-dimensional color coded arrav. with genes displayed along one dimension 
and tissue samples along the other: artificial color values are assigned at each arra\ point to indicate the 
extent of expression in a third dimension. Clustering analy sis rev eals patterns in the color distribution 
within the array which is disrupted when \arious randomization procedures are applied. The clustering 
of the genes in the data set reveals groups of genes whose expression is correlated across tissue types. 
The algorithm separated the tissues into distinct clusters. 

Pharmacological effects of compounds actual 1\ used or being screened for use m cancer 
chemotherap\ wee analyzed by cluster analssis at the National Cancer Institute (W'einstein cf uL. 
Science 275:343-"49 { 1997)). More than 60.000 compounds were screened against a panel of 60 human 
cancer cell lines. .A. 50% grow th-inhibitorx concentration of a compound in a gi\ en cell line, w hen 
analyzed across all cell lines, provided detailed information on mechanisms of drug action and drug 
resistance. Panerns of acti\ it\ were first anai\zed by the COMP.ARE algorithm (Pan 1 1 ci cii, .1. Natl. 
Cancer Inst. 8 1 : 1 088 ( 1989); Jayaram, Biochcm. Bioplivs. Res. Coniniiui. 186:1600 (1992): Paull el a!.. 
In: Cancfk Cill'MC^TIllilUPEt tic ACiLNTS. Foye (ed.). American Chemical Societ}'. W ashington DC. 
1993. pp. 15~4-158l; Boyd ei uL. Drug Dev. Re.s.}4})\ (1995)). The procedures developed re[\ on 
three databases, an S database characterizing structural information on the candidate compounds, an A 
database related to the 60 cell lines and a T database mc hiding intormation on molecular targets of 
action. In an exaii^^le of the results of the analysis, a three dimensional arra> displaving compounds 
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versus tariiets, witli a false color code pro\ idini: a correlation coei'ficieiU in a tliird dimension for each 
position in the arra\. was developed. 

Certain problems arise upon consideration of the procedures current!} in use for the correlation 
and clustering of genome-derived attributes. I ' jf DNA microchips irilierently limits an\ analysis tct 
the sampling of the DNA sequence fragments emplosed as the capture probes bound to the chips. 
Detection of any DNA fragment which does not hybridize vn ith one of the capture probes is not possible, 
so that positive results are potentially lost. Additionally a mutation or other allelic poK inorphism may 
not bind to the capture probe under conditions of me>deraie or lo\v stringenc>'. so that again information 
relating to a posit i\ e result may be lost. 

For these reasons tliere is a need tor methods of genomic statistical analysis based on more 
comprehensive accessibilit\ to the genomes of the organisms bemg studied. Furthermore there remains a 
need for vva\s of presenting the information obtained in genotiiic analv ses of relatedness of genes, aiid in 
genonuc analy sis of response to actual or candidate pharmaceutical agents, that includes information 
gleaned from a comprehensive access to the genomes in question. The present invention addresses these 
needs, for use is made in the in\ ention of partial and full genomic sequences available from a large 
number of sequence databases in citistering anal\sis of the components appearing as independent 
variables in a particular study. 

SUMMARY OF THE INVENTION 

The invention provides no\el methods of geometric and hierarchical classification between at 
least two classes of data sets. Data sets ma\ represent cells, nucleic acid sequences, pols peptide 
sequences, or the like. The invention is able to utilize both standard DN.A microciiip arravs and 
non-DNA chip technologv' to provide input information on nucleic acid moieties of tiie specified classes 
of ceils. The data are then treated in various wa> s to pro\ ide representations of relatedness that are 
readilv interpretable b> the human e>e. The invention additionally provides novel methods for 
generating a representation of the correlation between at least two classes of cells, the L^-.ielation 
retlecting an\ changes in the composition and amount of nucleic acids present between the classes. 

The cell classes mav be tVom different .sources tor use in comparing differences between various 
cell populations. These differences include, but are not limited to, species differences, tissue ditTerences. 
disease state ditTerences. and drug treatment differences. Computer algorithms analvze input data 
reflecting differences between ciiosen cell classes and represent tfiem in a meaningful vva\ . 
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Prior to ilie present invention, input infonnation was obtained only using DNA-chip technology 
to anaKze the nucleic acids of t lie cell classes U> be compared. Drawbacks to these methods are that 
iden tiller setiuences need to be already knov\n and isolated. chi[) technology has si/e limitations related 
to the number of the nucleic acids iminobilized on the chips, and, once the ciiips were maiuifactured, it is 
viitualK' impossible to expand nucleic acid parameters. The invention provides the use of 
GeneCaMing'^\ a non-DNA chip technology, to assay differences between input cell classes. An 
unexpected resuh is that GeneCalli ng""^^ is able to provide sensitive comparisons between disparate 
groups above, thereby' sidestepping the limitations inherent in the use of DNA chip technology when 
assaying input nucleic acid population. 

The invention pi*ovides a novel method for generating the extent of relatedness reflecting 
similarities or differences in the presence and quantitation of the tragments among the classes by 
calculating a distance that reflects tlie amplitude o(\a difference \ ect or. In a sign ifl cant embodiment of 
this method tor generating the representation of relatedness. the extent of relatedness is prov ided bv 
generating a tree structure reflecting the relatedness between any two classes. The branches of the tree 
structure reflect the difference vectors and are rami fled from nodes. 

The invention also provides a novel method for generating a representation of the correlation 
between classes of data sets. In a signiflcant embodiment of the method for generating a representation 
of the correlation, the correlation is related to a set of orthonormal eigenvectors. In another signiflcant 
embodiment of the method for generating a representation of the correlation, the representation is a 
cluster diagram or a dendrogram, and includes a tree structure reflecting the relatedness of the pathways 
involved in tiie biochemical or physiological response to a difference between cells of the two classes. 

The invention additionally relates to prov iding geometrical representations of differences 
between classes of data sets. The geometrical representations encompass, by way of non limit inu 
example, principal component analy sis and principal factor analysis, as well as reduced dimensional 
representations derived from them. The geometrical representations are ba.sed on differences determined 
between classes of ecus using any method of inaly xing for the presence of genes, nucleic acids, or 
fragments thereof, including nucleic acid microchip arrays and differential display of expressed genes or 
nucleic acid fragments. 

The iiuetuion also provides display means for displaying the representation of the extent of 
relatedness, the correlation, and the geometrical representations of differences between classes of data 
sets, as well as the representations themselvo. 
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BRIEF DESCKIPTIOX OF THE DRAWING 

Figure 1 is a schematic flow diagram illustrating the principal steps involved iit generating the 
various representations of the invention starting from a set of sub sequence -selected fragments \ound for 
the samples. 

Figure 2 is a schematic How diagraiu ilhist rating the primary steps involved in carrying out a 
principal component analysis. 

Figure 3 illustrates hierarchical clustering of four drugs with sterile water as an outgroup. 

Figure 4 is a graphical projection of drug treatments and controls onto principal factors. 

DETAILED DESCRIPTION 

The present invention relates to methods for preparing representations of the re latedn ess between 
ceils of an\ two or more different classes of cells. The classes broad 1\ encompass cells arising in animal 
and plant organisms, the cells further being normal cells or cells in a diseased state, including tumor 
cells. They further include cells that have been treated with a putative pharmaceutical agent. The 
representations are obtained using experimental data that provide si/e and sequence information on 
nucleic acid fragments derived from each of the cellular sources. The fragmetus ma\ be prepared from 
the nucleic acid content of the cells in each class in any of se\ era I wa>s. For example, in a partieularK 
important embodiment, the\ ma\- be subjected to digestion by particular pairs of restriction 
endonucleases: alternativeK . in another important embodiment, cell extracts may be subjected to 
amplification using specially designed primer oligonucleotides. The present invention also relates to 
methods for preparing representations of the relatedness in terms of co-expression between the nucleic 
acid fragments so produced. 

The invention further relates to the representations provided b\ ihese methods, and to displav 
means on which such representations are displaved. The methods for preparing the fragments, such as 
the use of restriction endonucleases or the application of amplification primers, are chov :o pro\ ide 
subsequence information relating to the ends of the resulting fragments, while size determination 
provides the length of the fragment. In certain applications of these t> pes of information, the size and 
subsequence results can optionalK be scanned against databases pro\ iding known nucleic acid sequences 
m order to provide the identity ot one or more candidate tragmeius of known complete nucleic acid 
sequences having the correct length and terminal subsequences (f'. S. Patem No. xS7 1.697: Shimkets a 
ill. 1909 S'ciiurc Bioiecluiolo^- 1 ^: 798-803 ). I his database K>ok-up step is not a required feature of the 
current invention. For this reason, the present n.-prcscntaiKM\> and meiluxU are more comprehensive and 
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more informative of genomic variations among the samples than those currently known. As described in 
the Background of the Invention, currently known procedures are restricted in their comprehensiv eness 
to those nucleic acid fragments that are applied to DNA microchips as probe sequences in a given 
procedure. Hxcept for a narrowly limited set of model organisms with known genome sequence, tiic 
number of such probe sequences is considerably few er than the number of know n nucleic acid sequenco 
available in sequence databases and employed in the present invention. Furthermore, even for fully 
sequenced genomes, genetic variation is not adequately probed w ith existing DNA microchips. This 
distinction characterizes an important advantage of the instant invention. 

The invention additionally relates to providing geometrical representations of differences 
between classes of cells. The geometrical representations encompass, b\ way of nonlimitina example, 
principal cotnponeni analysis and principal factor analysis, as well as reduced dimensional 
representations derived from them. The geomcirieai represemations are based on ditTerences determined 
between classes of cells using an\ method ofanalvzmg for the presence of nucleic acids, or 

fragments thereof, nicluding nucleic acid microchip arrav s and differential display of expressed genes or 
nucleic acid fragments. 

As used herein, "sample" relates to a particular experimental state for which all the variables 
being studied in a project are held fixed. By wa\ of nonlimiiing example, if a variable is a class of cell, 
the "sample" refers to a particular cell t\ pe: if a variable is the subsequence pairs employed in the 
project, a "sample" refers to a particular subsequence pair: or if a variable is a set of putative 
pharmaceutical agents, a "sample" refers to a particular agent tVom the set. As used herein, 
"representation" relates to any graphical, visual, or equivalent non-verbal display that provides an ima-:e 
of the results obtained according to the methods of the present invention. More specifically, a 
"representation" of the invention is obtained b\ transforming the quantitative results gathered bv 
experiments underlying the inv ention. Lxamples of such data include, bv ua\ of non-limitinu example, 
differential gene expression across classes of cell, and/or across a set of putativ e therapeutic aeents. 
and/or equivalent tvpe^ of experimental parameter. 

In miportant embodiments, a representation of the inv ention is generated b\ algorithms executed 
in a computer and is suitable for display on a display means, such as a displa\ screen or monitor, 
emplov ed in the operation of the computer. The representation is al^o suitable for storing in a stora^e 
module or data archiv e of stich a computer. It is still further suitable for printing from the computer onto 
a medium such as paper or equivalent phvsical medium, and for recording it onto a portable storaiie 
medium, including, tor example, magnetic media, C D R( )\1> anJ equivalent storage media. As used 
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herein, '^display means" includes any of the objects and media identified above in this parai;raph, as well 
as equivalent apparatuses and objects suitable for displaying the results of computational processes for 
visual inspection. 

As used herein, ''extent of relatedness" is a characterization according to methods of the present 
invention of a degree of similarity or a degree of non-similarit> between any two members of the same 
type of element; in particularly important embodiments, the type of element may be classes of cells. 

As used herein, a "putative pharmaceutical agent" relates to a chemical compound or a 
composition comprising at least one chemical compound w'hich is a candidate for being a therapeutic 
agent. Any such therapeutic agent may be used in treating a tnammal suffering from a disease or a 
pathology. In treating the mammal with the therapeutic agent it is intended to attenuate the symptoms 
and/or the underlying causes of the disease or the pathologx , to ameliorate the symptoms and/or the 
underl\'ing causes, and/or to contribute to a cure of the disease or the pathology. Non-limiting examples 
of a putative pharmaceutical agent include an agent draw n tVom a chemical compound iihrar>': an isolate 
from a natural source: a compound synthesized spccificaHy as a putative agent: or a substance derived or 
obtained using the practices of genetic engineering and recombinant nucleic acid technolog\' such as a 
recombinant protein, a fragment of a recombinant protein, a recombinant pol\ peptide, a fragment of a 
recombinant pol>peptide. a recombinant peptide, or a nucleic acid including, for example an 
oligonucleotide intended as an antisense agent, and a recombinant gene intended for administration as a 
gene tlicrapeutic agent. 

As used herein, a "fragment" of a nucleic acid relates to a contiguous portion originating from 
the genomic or cDNA-derived nucleic acid from a class of cellb. The contiguous portion includes at or 
near each end a target subsequence defined according to the operational procedures disclosed herein, and 
includes all nucleotides in the sequence of the fragment bounded b\ the two target subsequences. The 
nucleotides between the two target subsequences, together with the subsequences themselves. deHne a 
"length'" of the fragment, as used herein. The target subsequences are identified, for example. b\ 
contacting the nucleic acid from the ceils with a specific pair of restriction endonucieases, or with a 
specific pair of oligonucleotide primers, and in equivalent wass. 

The information used in the present invention is obtained from experiments providing ihe results 
of differential gene expression wherein the difference relates lo an experimental state and a reference 
stale. Commonly a reference state refers to a normal, or an unperturbed, or a non-pathological class of 
cells. An experimental state ma\ relate to a certain set of conditions applied to one class of cells, and the 
corresponding reference state then relates to the same set of conditions applied to a second class of cells. 
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An experimental state may also relate to a class of cells in the presence of one or more putative 
therapeutic agents, in uliich case the reference state relates to the same class of cells in the absence of 
any putative therapeutic agent. An experimental state mas furthermore be obtained from a class of cells 
that is of interest in a particular set of circumstances. This includes cells of a given cell type, cells from 
a given tissue, and cells from a given organ, and further includes cells that may be noncancerous or 
cancerous. Types of cell encompassed within the present invention include, by way of non-limiting 
example, endothelial cells, mesothciial cells, and epithelial cells, Tissues and organs included within the 
present invention may be. by way of non-limiting exan^ple. lung, heart, skeletal muscle, smooth muscle, 
brain, central nervous system, peripheral nervous system, stomach, liver, kidney, reproductive tissues 
and organs, skin, and bone. Cancerous cells include, by way of non-limiting example, cells from prostate 
cancer, breast cancer, colon cancer, lung cancer, lymphatic or hematopoietic cancers, and also include 
cells obtained from tissue biopsies or from cell lines in the National Cancer In.stitute human tumor cell 
line panel. The cells subjected to analysis in the present inv ention ma> also originate from plants, yeast. 
t\mgi. and other taxonomic groupings. 

The methods of evaluating the extent of relatedness between classes of cells, for e.xamplc. 
between a first class of cells and a second class of cells, are founded on ev aluating tlie extent of 
relatedness of the expression of particular genes between the cells of the two classes. In a preferred 
embodiment of the invention, similarities and differences in the susceptibility of the nucleic acid present 
in the cells to digestion b\' specific pairs of restriction endonuc leases are determined, according to the 
methods of the present invention, by procedures that are disclosed in detail in co-owned I*. S. Patent No. 
5,871.607 to Rothberg et uL. and in Shimkets ei ai 199*-) (Nature Biotcchnologs 1 7:798-803 ). both of 
which are incorporated herein by reference in their entiretv . 

Brietlv: for an\' experimental state of a class of cells, the nucleic acid content of the cells, 
preferably in the form of a preparation of cDN. A from the cells, is subjected to restriction endonuc tease 
C'RE") digestion by specific pairs of endonucleases. Each member of the RE pair is chosen to optimize 
tlie likelihood that a restriction fragment resulting from the nuclease digestion will be a unique fragment. 
In an important implementation of this method, the restriction nuclease digestion is carried out on cDN.A 
prepared from the ceils of the class in the given experimental state. This implementation leads to 
emphasis on genes that are expressed in the experimental state, main" of w hich ma\' be characteristic of 
the given experimental state and be more poorly expressed, or not expressed at ail significantlv. in a 
different experimental state. A large number of specific pairs of nucleases may be emplov ed. 
Alternatively, expression of a gene ma\ be repressed in a characteristic wa> in a given experimental stale 
and be expressed at a hiuher level, such as at a constitutive tevd. in a different experimental state. B\ 
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way of non-limiting example, several pairs of restriction nucleases that may be employed in 
implementing the present invention are disclosed in U. S. I*atent No. 5,87 1 ,697. 

In an alternative embodiment, the extent of relatcdne^s may be obtained by amplification 
fragment length polymorphism analysis (^'AFLP'^). Briefly, amplification of the nucleic acid content of 
the class of cells being examined is subjected to a primcr-dcpeiidcnt amplification procedure in which 
any of a set of primer pairs is used to initiate amplification. Amplification procedures are described in 
considerable detail in. for example, Innis etal.. PCR PROTOCOLS, A GuiDG TO Mhthods AND 
APPLICAI IONS. Academic Press. New York (1989), and innis ef a!., PCR STRATI-:gies. Academic Press, 
New York (1995). The primers of each primer pair are different from each other, and reflect different 
subsequences that are the object of the amplification process. Amplification may proceed by any 
procedure, including polymerase chain reaction, known in the field of molecular biology. In AFLP. the 
length of an amplicon found in a given experimental state differs from the length found in a different 
experimental state. This may arise, for example, if the given experimental state arises from a mutation 
that occurs in a subs>equence recognized by a primer used in the amplification reaction. It may also arise 
from a deletion from, or an insertion into, the nucleic acid of the cells in that state. 

The experimental and computational procedures that ma\ be empkned to generate the 
representations of the present invention are described generally below . 

Measurements 

At the outset, the gene expression levels are determined experimental!) . This can be done, in a 
preferred embodiment, b> following the general protocols of differential expression using restriction 
endonucleases (U.S. Patent No. 5.87 L697). ["or each pair of restriction enzymes and each biological 
sample, a pool of fiuorescently-labeled DNA tVagmenis is generated. Electrophoresis is then performed 
to separate these fragments based on size, and an intensit\. designated as k,^(x), where s labels the 
sample, i.e., the cell class: r labels the restriction enz\me pair, i.e., the gene fragment: t labels the trial, 
and X is the length of the fragment as determined by electrophoresis, is detected. The length x may be 
either a continuous muex or a convenient discretization. As an example, the resolution of the 
electropherogram may be set to a discretization of 0.1 nucleotide ("nt"). Commonly three independent 
trials are performed. A mean signal I.^x) is then obtained by averaging over the n, trials. 



IJx)-(l/igi, l,,(x)l, 
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Next, lengths \ for eacli restriction enz>me pair r where some of the samples have a significant 
ditTerencc in measured intensity are identified. Such a difference is determined with respect to cell 
types, or with respect to the presence vs. the absence of a putative pharmaceutical agent. Labeling the d" 
such difference d, the values 1,^= I,r(x ) are then collected. An\ of several methods for identifying 
significant differences may be cmplo\ ed, some of which are outlined herein. For example, an important 
method invuKes tlie following computational steps; 

1. The mean l^x) = \J>^) is evaluated. 

2. All positions, i.e. lengths, where, tor at least one sample, i^^fx) - l^x) is larger than some 
threshold value, are marked. 

3. The largest \'alue of l^^l-^) ~ U^^- determined as a difference between a sample state and 
the mean for restriction enzvnie pair r. is foimd and the length x. indexing the difference, 
is marked. 

4. Step 3 is repeated for succeed ingl\' smaller values of the intensity difference. It the 
length X that marks the current largest difference is within a distance w from the lengili 
of a previously identitled difference, the current difference is skipped and the next 
smaller difference is considered. 

5. Step 4 is repeated until there are no more differences to consider. 

Another method iinoKes finding differences that meet a statistical criterion. A particular 
example of such a method iiivoKes the computational steps of: 

1 . detlning a set of sample classes and assigning each sample to i\ particular chu> c: 

2. for each restriction enz>tne pair r and length x. ev aluating tlie F-statistic for the set of 
measurements and the classes c to which samples are assigned. thereb\ providing 
the probabilitv' p,(x) that an\ differences betv\een sample classes may be explained bv 
random variation (See. for example. P. Hinton. Statistics Explained. Routledge 1^^5); 

3. ordering the probabilities p^lx) from smallest (most significant) to largest (least 
significant); 

4. optionalK truncating the list at some threshold value of p,(x) above vvhicli differences are 
no longer considered significant (accepted values are p,(x) = O.Ol to 0.05); 

5. finding the smallest value of p,(x) and maiking the length x a^ a difference tor restriction 
enzvme pair r; 
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6. repeatiiiij; step 4 and determining whether the length \ that marks the current difference 
is in a region that is within a distance vv of a prc\ ious difference, in which case the 
current difference is skipped and the next smaller distance is considered; and 

7. contiiuiing until there are no more differences to consider. 

These e\en)plar\ computational procedures provide a set of measures of intensity l,j for the class 
of cells in sample s at difference d. 

Distances 

For hierarchical clustering, a distance D,. may be detmed as the distance in vector space 
between pairs of samples s and s". A variets' of methods for calculating D,, are available. Some 
examples, which are intended as being noniimiting. are pro\ ided below, 

D^, as a scaled correlation function; 

1. One calculates = (1 nj 1, !,j and a, - [( Imjl, - ii^VT^- H'daia is missing, for 
example no measurement of 1,. exists for some sample s. that sample is excluded from the 
sum and n, is reduced b> 1 . 

2. One calculates J,., - (L, - ' a^. If data is missing for f. then J„ is defmed as J,j = 0. 

3. One calculates - ( i. n/) J,., and ^ |( 1 'iij)!, - f-ij"!"' . 

4. One calculates K^j = (J.^^ - iij ' a, . 

5. One calculates the co\ ariance matrix S,^ = ( 1/n,) I^, K,,jK. , 

6. One c cuiates the correlation matrix C\. ^ S., f S,.S, 

7. One calculates D,, = [2 - 2 ■ . 

D,,- as a Eticlidean distance: D^, ^ f I,, (Lj - f ,.)- ]" ^ . 

D,, asa Pearson distance; D., - [ Ij (I,,- L af ]"Svherea, is defined in step 1 of scaled 
correlation function above. 
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as a pairwise Pearson distance: 

1 . One calculates the covariance matrix S^^ ^ ( l''n^,)[ ]Lj l^J^ j - (I,, l^j ) (Ij ) / Hj ]. 

2. One calculates the correlation matrix S^^- / [ S^, S,, ]''' . 

3. One calculates D,. = [ 2 - 2 C,,- f '\ 



D^,- as a Mahalanobis distance: 

1 . One calculates tiie covariance matrix Sjj ^ (I, I^j ) - (1, 1,^) (I, )/n, 

2. One calculates tiie correlation matrix Cjj =^ S^ij / [ S,j^ Sj j f ' and its matrix inverse C '.^; 

3. One calculates D,^ = [ I,,- (!,, - I.,) C' (L, - 1,,, ) T 

It is cotitemplated tiiat other di'^~^'l^ce methods known in the art ma> be used in the iineiuion, 
such as Spearman correlation, and the like. Other methods known in the art can he found, for example 
and not be means of limitation, in V. Mardia, J. T. Kent, and j. \\. Bibb\. Ml'1.7 iv ARlATli ANAI.N'SIS. 
Academic Press. New \'ork, 1979. 

Hierarchical Clustering 

The distances can be used to perform hierarchical clustering of the samples. A general algorithm 
for clustering is described below. 

1 . Each sample s is assigned its ow n initial chister c. 

2. One ilculates all the distances betw een pairs of clusters and fuuls tiie smallest distance. 
These two clusters are joined into a single cluster and the mnnber of clusters is decreased 
by I. 

3. Step 2 is repeated until only a single cluster remains. 

In order to imnlement this algorithm, a method to calculate the distance between pairs (>f clustjr^ 
is also required. Some nonlimiting examples of such calculation^, using well-know n methods, are 
indicated below. 

Nearest neighbor, single linkage: The distance between clusters c and c* is the smallest distance 
D,, . where s ranges over all samples in cluster c and s' ranges over all samples in cluster c\ 

Unweighted pair group method using arithmetic averages ( L PGMA). also known as average 
linkage: The distance between cluster^^ c cmd c" is (I.. [)„ ) ■ (n n^ ) where s rang.^s overall samples in 
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cluster c. s' ranges overall samples in cluster c*. n. is tlie number of samples in cluster e. and n,. is the 
number ofsamplcs in cluster c'. 

Purtliest neighbor, complete linkage: The distance bet\seeii clusters c and c* is the largest 
distance D,, where s ranges over all samples in cluster c and s* ranges o\ er all samples in cluster c'. 

Other distance-based hierarchical clustering methods are well-knoun. Sec. for example, Wcn- 
I lsiung Li. MOLECULAR HvoLU l ldN. Sinauer Assoc. 1997. 

Software packages are available to perform the clustering and display the results. See. for 
example, Phylip. .loe Felsenstein. http://evolution■^enetics. wash inuton.edu for clustering, and Treeview, 
Rod Page. [mp:/^ta\onom\ .zooloiiv.ijla.ac.uk/rod/treeview.httnl for displa> . The source code for the unit 
within Phylip emplo\ed for tiie clustering, and the downloaded executable tlie of Treeview for Windows 
95 and Windows NT. as well as a manual for Treeview. are available from the owner of the present 
appiicatioti. 

Two-Dimensional Clustering 

It is also possible to cluster the distances, rather than cluster nig the samples. One snnpK 
exchanges the roles of the samples and differences in the equations above. Fiinhermore. it is possible to 
perform clustering of both samples and differences, and then to displa\' the measurements 1,^ in which 
both samples and differences are presented in cluster order. 

Principal Component .Analysis and Principal Factor .\nalysis 

Principal component analysis is described in standard texts. See, for example. Mardia, Kent, and 
Bibb\. To perform principal component analysis, one begins with a correlation matrix C\, as defined 
abo\e. in the section "[Distances". (.MternativcK . one could use the covariance matrix S-^ ). Eiuenvalues 
and eigenvectors, defnied such that C\. g^^ = a,g,, . where the i '' eigenvalue is a. and its eigenvector is g„ . 
are calculated. The eigenvalues are ordered from largest to smallest: a, > a. > ... > a, . To obtain a 
reduced dimensional depiction of the samples, a number of desired dimensions k is chosen. Then, in k- 

dimensional space, sample s is represented as the point (g,,. g., g^. ). Samples that are close in the k- 

dimensional space ha\e similar expression protlles and ma>- be considered to be related. 

As an allernati\ e to using the correlation matrix C\, as the starling point for principal component 
analysis, it is possible to calculate principal components using the inner product matrix from 
nudtidimensional scaling detnied as 

B= HC H D 
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where C is tlie correlation matrix, hi is the centering matrix uilh diagonal elements given bv 1 - (1/n) and 
off-diagonal elements -( 1/n). where n is the number of items being correlated. (See, for example, 
Mardia, Kent, and Bibbv. Mnltivariate Analysis, and Arkin. Slien. and Ross, Science 277: 1275 (1097)). 
The k'*' principal component is then the k"' eigenvector of B normalized to unit length and ordered by 
decreasing eigenvalue \, and the k'^ principal factor is obtained by scaling the eigenvector by ' \ The 
projection of sample s onto the k"' principal factor is the element of the factor for rou s. The 
components or factors are ordered from 1 (corresponding to the most informative) to n (corresponding to 
the least informative). By using .some, but not all, of the components or factors, the samples can be 
represented in a small-dimensional geometric space. Furthermore, the amount of information retained in 

the representation can be related to the eigen\ es of the components that are used (See Mardia. Kent, 

and Bibby). 

A centered inner product matrix B appropriate for principal com[)onent or prinicpal factor 
analysis can also be obtained from an\ distance niatrix D,, as 

B = HAH (3) 

where 

A.,- --l/2(a,r. ' (4) 

To perform principal factor analysis, factor i is defmed as 1^, = a,"' g^, where, as before, a, is the 
eigenvalue of the i''' eigen\ector g„. An onhonomial rotation niatrix G (I. Gfi^^ is 1 if i = k and 0 
otherwise, det(G) = -1 ) is introduced and the factors are rotated to obtain rotated coordinates for the 
samples. Thus, to obtain a k-dimensional representation of the locations of the samples, the followim: 
operations arc performed: 

1 . One calculates the correlation matrix C,, or the covariance matrix S,, where s and s" 
label indi\ iduai samples. 

2. One calculates the eigenvalues a, and eigenvectors g^, for the matrix, with a, > a, >...> a,. 

3. Unrotated factor loadings h.^j = a'' - g^. are detlned, 

4. The first k factor loadings and an orthonorma! rotation matrix G are selected. The j"' 
coordinate of sample s in the rotated space is I, hs, G, , 

The rotation matrix G may be optimized according to standard criteria. See. for example. 
Mardia. Kent, and Bibb>. Ch. ^).6 on Varimax rotaiuMi. sn/u-n. The rotated axes represent factors that 
influence the observed measurements for the samples 
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in implemeiitinu the melhods of the present invention, these operations mav be seqiientiallv 
combined in any of several ways according to the intended displa\. i.e., the nature of the relatedness that 
is intended to be shown. 

Also, the information from the principal factors can be used to help fiker tlie experimental noise 
from the correlation functions. For example, it is possible to select a ctit-off principal factor j < n, then 
compute distances and correlations between samples based on their representation in the j-dimensional 
principal factor space. 

As a nonlimitinii example of the computational procedures that may be employed in the present 
invention, a schematic o\ er\ iew of procedures that inay be adopted is presented in Figure I . The 
experimental results represent the sample-dependent and selection-dependent intensities obtained in an 
experiment, arrayed in a measurement matrix. In the implementation shown in Figure I. the difference 
bands having \ arious, defmcd. nucleotide lengths are arrav ed as the columns of the matrix: the>' are 
obtained in various experiments tliat are selected using different members ol'the sets of subsetiuence 
pairs. The samples represent the chisses of cells, or cells treated with a set of putati\e pharmaceutical 
agents, or analogous sample sets, and are arra\ed as the rows. 

The values arra>ed in the measurement matrix ma\ then be subjected to correlation anaK sis to 
provide either direct sample correlations or correlations of differences. Tiie ineastirement matrix can 
also be subjected to a calculation providing a vectoral distance between samples: such a sample distance 
may also be obtained from the sample correlation result. The distance \ector can further be subjected to 
a linkage analysis to pro\ ide hierarchical clustering of the samples. .Additional!) . tiie correlated samples 
may be subjected to principal component analysis providing the principal factors contributing to a state 
or to a difference 

A nonlimiting e.xample of the wa\ in which a principal component analvsis ma> be :arricd out. 
using methods described herein, is presented in Figure 2. The correlation matrix or the centered inner 
product matrix described above is subjected to appropriate operations to pro\ ide the principal 
components and the prin^.j.al factors, based on their eigenvalues and eigenvectors. Advantageousl> a 
reduction in the number of dimensions empio> ed in the number of eigensiates may provide a filterini: 
effect, reducing the noise in the vector distances calculated. 

The representations provided m the present invention tind use in various applications of 
genomics in the biological and medical fields. Lxteius of relatedness and correlations provide rapid 
overviews of enz\ niatic reactions, metabolic pathuav v and piiv Mologicat etTects that become 
distinguished when comparing states. W hen a patlioiogi.ai >iate is compared with a normal state, for 
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example in a niainrual, and especially in a human, the displav of distinguished patliuavs is instructive iti 
the development of therapeutic approaches and/or therapeutic agents for the treatment of the pathological 
state. When a putative pharmaceutical agent is compared to a state that omits the agent, or when one 
such agent is compared witli another, important information is pro\ ided relating to the metabolic 
reactions induced by or undergone by the agent or agents, leading to optimal choice of such agents. This 
information ma\ also provide leads to the development of novel pharmaceutical agents. If the genome 
being studied is a plant genome, such as the genome of an important crop plant, analogous principles 
apply. 

Nucleic acid assays 

The present invention provides a method for generating a representation of the extent of 
relatedness between at least two classes of cells. In this method, itie cells in each class are chosen from 
among cells of a given cell t\pe. cells from a gi\ en tissue, and celts from a gi\ en organ. Generation of 
nticleic acids from the cell samples of clioice may be as described in tlie GeneCalling' ^' methodologv, 
See U.S. Patent No. 5.871.6*^)7. The method includes the steps of: (a) defming a pluralit\ of pairs of 
nucleotide subsequences, each pair consisting of a first subsequence and a second subsequence: 
(b) isolating the nucleic acid of each class of cells and assaying for the presence of a nucleic acid 
fragment with the first subsequence at one end and the second stibsequence at another end and having a 
length separated b\' the first and second subsequences, and qtiantitating the e.\tent to which each 
fragment is present: and (c) determining the extent of relatedness refiecting similarities or differences in 
the presence and quantitation of the fragments among the classes tising software algorithm programs 
known in the art, 

One important embodiment of this method, i.e.. deierininmg the presence of tlie fragments and 
quantitating the amoimts present, as described in step (b) abo\ e, is carried out b\ a process that includes 
the steps as follow . First, samples of the nucleic acid from the cells of each class are digested w ith a 
plurality of specific pairs of restriction endonucleases ("REs"). Each sample is treated by one RE pair, 
where one RE of the pair targets the fir.^L subsequence described in step (a) above, and tne second RE of 
the pair targets the second subsequence, with each digestion providing specific restriction fragments. 

Second, double stranded adapter DN A molecules are h>bridi7.ed to the fragments. Each adapter 
DN.\ molecule comprises: (/la shorter strand, preferable having no 5" terminal phosphate, consisting of 
a first and second portion, the first portion being a region at the 5" end that is complemeniar> to the 
overhang produced b> one of the REs of the given pair and a second portion hv bridizable to the opposite 
longer strand of the adaptor, and (//) a longer strand. preterabl\ liav ing no 5" terminal phosphate. 
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consisting of a Hrst portion at its 3" end complementary to the above-mentioned second portion of the 
shoner strand, and an optional second portion at its 5" end comprising a unique region not hybridizable 
to any sequence present in the original sample population. See U.S. Patent No. 5.871.697. The Ionizer 
strand is optionally labeled with Huorochrome 208, although ariy DNA labeling system that preferably 
allows multiple labels to be simultaneously distinguished is usable in tiiis invention. See, e.g., Ausubel, 
ei a/. CURRKNT PROTOCOLS IN MOLFCIJLAR BlOLOGV, John Wiley Si Sons, New York, NY, 1993. 

Third, output signals from each ligated fragment are detected for each sample population so 
treated. Each ligated fragment generates output signals that characterize (a) the presence of the given 
subsequences corresponding to the RE pair used in a particular run, (b) the length between the two 
subsequences corresponding to the two REs employed in a given run. and (c) the quantitation of the 
relative amounts present of each fragment so generated in a gi\ en run. 

Optionallv, a nucleotide sequence database may be searched for sequences that are predicted to 
produce, or alternative!) . not produce, the one or more output signals generated b\ the nucleic acid from 
the ceils of each class, given the parameters described above. The analysis methods comprise, tirst, 
selecting a database of DNA sequences representative of the DNA sample to be analyzed, second, usin<: 
this database and a description of the e.xperiment to deri\ e the pattern of simulated signals that uould be 
generated, contained in a database of simulated signals, that will be produced by DNA fragments 
generated in the experiment, and third, for any particular detected signal, using the pattern or database of 
simulated signals to predict the sequences in the original .sample likely to cause this signal. Further 
analysis methods present an easy to use u.ser interface and permit determination of the seqticnces actuali> 
causing a signal in cases where the signal mav arise from multiple sequences, and perform statistical 
correlations to quickl,v determine signals of interest in multiple samples. .A sequence from a searched 
database is predicted to produce the one or more output signals when that sequence has both (a) the same- 
length between occurrences of target nucleotide subsequences as is represented by the one or more 
output signals, and (b) the same target nucleotide sub-sequences that are represented b\ said one or more 
output signals, or target nucleotide subsequences that are members of the same sets of target nucleotide 
sub-sequences represented by the one or more output signals. 

A first analysis method is selecting a database of DNA sequences representative of the sample to 
be analyzed. In the preferred use of this invention, the [)NA sequences to he anal\zed will be derived 
from a tissue sample, typically a human sample examined for diagnostic or research purposes. In this 
use, database selection begins with one or more puhlici> available databases which comprehensivelv 
record all observed DNA sequences. Such databaNcs are GenBnuk from the National Center for 
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Bioteclinology Infornintion (Bethcsda. Md.), tiic EMBL Data Librar\ at tlie European Bioinforniatics 
Institute (Hinxton Hall. UK) and databases from the National Center for Genome Research (Santa Fe, 
N.Mex.). However, as any sample of a plurality of DNA sequences of any provenance can be analvzed 
by the methods of this inv ention, any database containing entries for the sequences likely to be present in 
such a sample to be analyzed is usable in the further steps of the computer methods. 

A second analysis method uses the previously selected database of sequences likely to be present 
in a sample and a description of an intended experiment to derive a pattern of the signals which will be 
produced by DNA fragments generated in the experiment. This pattern can be stored in a computer 
implementation in any convenient manner. In the following, without limitation, it is described as beimi 
stored as a table of information. This table may be stored as individual records or by using a database 
system, such as any conventionally available relational database. Alternativel>', the pattern mav simply 
be stored as the image of the in-memor\' structtu'es which represent the pattern. 

A second important embodiment of this method, i.e.. determining the presence of the frauments 
and their quantitation, as described in step (b) above, is carried out b\ a process that includes the steps as 
follow. First, for each pair of nucleotide subsequences selected, a pair of oligonucleotide primers are 
provides, the pair consisting of a first primer and a second primer, wherein the first primer is 
complementary to the first subsequence and the second primer is complementar\ to the second 
subsequence. Second, the nucleotide sequence between the frsi subsequence and the second 
subsequence are amplified using the oligonucleotide primers to priine the ampliHcation, thereh\ 
providing an amplicon characterized by the subsequence pair, a length between the two subsequences 
corresponding to the two primers employed in each pair and a quantitation of the extent to which each 
amplicon is present. Third, output signals are generated as above tor each amplicon. each output signal 
characterizing (a) the subsequences of the pairs of primers, (b) the length, and (c) the quantitation. 
Optionally, a nucleotide sequence database ma>' be searched for sequences that are predicted to produce, 
or alternatively, not produce, the one or more output signals generated by the nucleic acid from the cells 
of each class, given the parameters described "bo\ e. Analysis methods are as described above, 

This invention can be applied, for e.xample and not bv way of limitation, to //; vitro cell 
populations or cell lines, to />/ vivo animal models of disease or other processes, to human samples, to 
purified cell populations perhaps drawti fron\ actual wild-type occurrences, and to tissue samples 
containing mi.xed cell populations. The cell or tissue sources can advantageously be a plant, a single 
celled animal, a multicellular animal, a bacterium, a \ irus. a fungus, or a \east. etc. The animal can 
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advantageously be laboratory animals used in research, such as mice engineered or bred to have certain 
genomes or disease conditions or tendencies. 

Cells used in the iiu cntion mn>' be obtained from a mammal, preferably a human, having or 
suspected of having a diseased condition. In one embodiment, the diseased condition is a malignancN". 
The in vitro cell populations or cell liiics can be exposed to various exogenous factors to detenu ine the 
effect of such factors on gene expression. In a preferred embodiment, the exogenous factor is a putative 
pharmaceutical agent. Cells so contacted with a putative pharmaceutical agent are treated with an 
amount of the agent sufficient to effect a change in the state of fhose cells or w ith an amount of the agent 
less than or equal to a predetermined upper limit of dosing concentration, prior to their being assaved. 
Measures of relatedness and extent of correla:.. n may be made between cells so contacted with putative 
phanriaceutical agent and. for example, ceils not so contacted. 

Extent of relatedness methodolog> 

The present invention provides a representation of the extent of relatedness bf?t\veen a tlrst class 
of cells and a second class of cells. The ceils in each class are chosen from among cells of a given cell 
type, cells from a given tissue, and cells from a given organ, as described above. The extent of 
relatedness reflects similarities or differences in the presence of pairs of nucleotide subsequences, eacli 
pair consisting of a first subsequence and a second subsequence, in a nucleotide length separating die 
first and second subsequences of the pair and in a quantitation of the extent to which each pair ha\ ing the 
determined length is in the classes of cells. Input information of the fragments to be analyzed are 
obtained by methods of nucleic acid analysis and quantitation as described in the NCCI.FIC .ACID ASSA^'S 
section above. 

The measure of relatedness is provided b> calculating a distance that rellects v\c amphiude of a 
difference vector. A difference vector is defmed as a difference between a tlrst vector and a second 
vector. Herein, the first vector reflects information derived tVom the quantitation tor each subsequence 
pair obtained tor the first class of cells, and correspondingly, the second vector rellects tlie analogous 
information derived from the second class. The different elements of each vector relate to data obtained 
using different subsequence pairs. 

In an embodiment of the representation, the extent of relatedness is related to a distance. This 
distance reflects the amplitude of a difference vector that is a difference between a Hrst vector which 
retlects information derived from the quantitation for each subsequence pair obtained for the tlrst class 
and a second vector which rellects the correspondinu informaiion obtained for the second class. The 
different elements of each vector relate to data obtained using different subsequence pairs. 
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In an additional signincant embodiment, tlie representalioii includes a tree structure reflectim; 
the extent of relatedness is provided by generating a tree structure reflecting the relatedness between any 
two classes. The branches of the tree structure reflect the difference \cctors and are ramified from 
nodes. 

In important embodiments of the representation of the extent of relatedness. the representation is 
obtained employing the methods of the invention, including the methods that have been summarized in 
the paragraphs immediately above. 

In additional significant embodiments of the representation of the extent of relatedness. the cells 
in at least one class are obtained as described in the \iJCL[-:ic ACID Analysis section above. 

Correlation analysis methodolog} 

The i mention also provides a method fur generating a representation of the correlation betvseen 
a first class of cells and a second class of cells. The correlation reflects a change in tlie nature and 
amount of nucleic acids present in the classes. In this method, the cells in each class are chosen from 
among cells of a given ceil type, cells from a given tissue, and ceils from a given organ. The method of 
nucleic acid analysis and quantitation are as describe in the Ni;ci.f:ic" ACID ASSAYS section above. 

Upon generation of a signal output, the correlation between the cells of the first class and cells of 
the second class are correlated, and a representation of the correlation is prepared. The quantitation of 
the fragments in the invention corresponding to the RE pair u.sed in a given run and the length of each 
fragment so generated: thereby providing a quantitative meastire of the extent to uhich the nucleic acid 
present in the cells in eacii class contains fragments ha\ ing the specific subsequence pairs and the 
nucleotide length between tlie pairs. 

In a signiilcani embodiment of the mctliod for generating a representation of the correlation, the 
correlation is related to a set of oalionormal eigenvectors, as described in the DlS[,\NCHS section above. 
The elements of the basis set upon which the eigenvectors are constructed reflect particular biochemical 
or physiological n-^t^n^ av s correlated between the cells of the two classes Each of these eigenvectors is 
associated with an eigenvalue that is an integer greater than zero. After defming an upper limit of the 
eigenvalties to he used, the coefficients of the basis set elements in each eigenvector whose eigenv alue is 
less than or equal to this upper limit reflects the contribtition of the corresponding pathway to tlie 
biochemical or physiological differences correlated between the cells of the llrst class and the cells of the 
second class. 
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In another significant embodiment of tlie method for generating a representation of the 
correlation, the representation is a cluster diagram or a dendrogram, and inckides a tree structure 
reflecting the relatedness of the pathways involved in the hiochemica! or phssiologicai response to a 
difference between cells of tiie two classes. In obtaining this representation, a correlation matrix is 
calculated that provides a distance determination in which the distance retlects the amplitude of a 
difference vector. This vector is a difference between two vectors each of which retlects information 
obtained for the response of one of the two classes to the difference between the classes, and w herein the 
branches of the tree structure reHcci the difference vectors and the branches are ramitled from nodes. 

In additional significant embodiments of the representation of the extent of correlation, the cells 
in at least one class obtained as described in the Ni.'CLEIC ACH) ANALYSIS section above. 

Display means 

The present invention also provides a display means dispia\ing a representation of the extent of 
relatedness between a llrst class of cells and a second class of cells. 1 he cells in each class are chosen 
from among cells of a given cell type, ceils from a given tissue, and cells from a given organ, as 
described above. The extent of relatedness rellects similarities or difterences in the presence of pairs of 
nucleotide subsequences, eacii pair consisting of a Hrst subsequence and a second subsequence, in a 
nucleotide length separating the first and second subsequences of the pair and in a quantitation of the 
extent to which each pair having the determined length is in the classes of cells. 

In a significant embodiment of the display tneans, the extent of relatedness is related to a 
distance. This distance reflects the amplitude of a difference v ector that is a difference between a first 
vector uhich retlects information derived from the quantitation for eacli suhsequericc p-^ir obtained for 
the first class and a second vector which reflects the corresponding information obtained tor the second 
class, The different elements of each vector relate to data obtained u^ing ditTerent subsequence pairs. 

in an additional significant embodiment of the display means, the representation includes a tree 
structure reflecting the relatedness between any two classes, in which the branches of the tree structure 
reflect the difference v ectors and ilie branches are rami [led from nodes. 

In iniportant embodiments of the display means dispiav ing a representation of the extent of 
relatedness. the representation is obtained emplov ing the methods of the invention, including the 
methods that have been summarized in the paragraphs immediately above. 
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In additional significant embodiments of the displa\' means display inu a representation of the 
extent of relaiedness, the cells in at least one class obtained as described in the NUCLEIC AclD Analysis 
section above. 

The present invention additionally provides a display means displaying a representation of the 
correlation between a first class of cells and a second class of cells. The cells in each class are chosen 
from among cells of a given cell type, cells from a given tissue, and cells from a given organ, as 
described above. The correlation reflects differences between tlie first class and the second class in the 
presence of a pair of nucleotide subsequences, each pair consisting of a first subsequence and a second 
subsequence and the nucleotide length separating the first and second subsequences of the pair, and in a 
quantitation of the extent to which each pair having the determined length is present in the cells. 

In an advantageous embodiment of this display means, tiic correlation is related to a set of 
onhonormal eigenvectors. The elements of the basis set upon which the eigenv ectors are constrLictcd 
reflect particular biochemical or ph\ siological pathwav s correlated between the cells of the two classes 
Each of these eigenvectors is associated with an eigenvalue that is an integer greater than zero. After 
defining an upper limit of the eigenvalues to be used, the coefficients of the basis set elements in each 
eigenvector whose eigenvalue is less than or equal to this upper limit reflect the contribution of the 
corresponding pathwa\ to the biochemical or ph\ siological differences correlated between the cells of 
the first class and the cells of the second class. 

In an additional advantageous embodiment of the display means displac ing a representation of 
the correlation, the representation is a cluster diagram or a dendrogram, and includes a tree structure 
reflecting the relatedness of the pathwav s inv oU ed in the biochemical or ph\ siological response to a 
difference between cells of the two classes. In obtaining this representation, a correlation matrix is 
calculated that pro\ ides a distance determination in whicli the distance rellects the amplitude of a 
difference vector. This v ector is a difference between two v ectors each of w hich reflects information 
obtained for the response of one of the two classes to the difference between the classes. The branches 
of the tree strucli::''' reflect the difference \ e<'U^vs and the branches are rami lied from nodes. 

In imponant embodiments of the display means disptaving a representation of the correlation, 
the representation is obtained employing the methods of the invention, including the methods that have 
been summarized in the paragraphs immediately abov e. 

In important embodiments of the representation of the correlation, the representation is obtained 
emploving the methods of the invention, including the methods that have been sinnmarized in the 
paragraphs immediatelv' above. 
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In additional significant cnihodinients ofllic displa\ means displaying a representation of the 
correlation, the cells in at least one class obtained as described in the NtJCl.KIC ACK) ANALYSIS section 
above. 

Other Aspects 

In addition to providing representations of cells, the techniques described here are also useful for 
providing- representations of nucleic acid fragments or genes. The staning point for the analysis is the 
matrix described previously, where s labels the sample (or group of samples or distinct types of ceils) 
and d labels a particular measurement of the expression level of a particular gene in that class. Rather 
than generating representations based on the rows of 1. each representing a different sample or group of 
samples, it is possible to generate representations based on the columns of 1, each representing a different 
nucleic acid. Hierarchical and geometrical representations of nucleic acids, based on their relativ e 
abundance across a series of cells, cjn be used to infer genes that are co-expressed and are likeK lo have 
related biological function. 

Other Embodiments 

The data matrix of intensities I can be described more generallx' as a representation in which 
each row corresponds to a particular biological sample or group of samples, and each column 
corresponds to a particular nucleic acid molecule or class of molecules who.se quantities are measured in 
each of the biological states. 

In addition to the differential-display methods described to prov ide measurements of nucleic acid 
quantities, other methods for obtaining measurements of the nucleic acids present in a cell are available. 
These include restriction fragment length pol> morphism. amplification fragment length pol\ niorphisni. 
EST sequencing, serial anal\ sis of gene expression, hvbridi/ation lo oligonucleotide probes, and other 
methods known in the art. Other methods, such as quaiuitlcaiion b> TaqMan or Northern blots, are also 
used. All of these methods generate data sets that can be anal\ zed according to the methods described 
here. The measurements fj for each biologir^l state and nucleic acid can correspond to absolute 
concentrations, concentrations relative to a standard (either ratio or numeric difference), or other 
convenient measures. 

The methods of the invention includes anal\ sis of populations ranging from 5, 10, 25. 50, 100. 
1000. 10.000 or 100.000 or more members. 
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EXAMPLE 

Male Spragiic-Dawlcy rats (I iarlaii Spraguc Daw ley. Inc., Indianapolis, Indiana) of 10-14 week^ 
of age were gavagc-fed and closer] once a da\ for three das s with the following drugs, dissolved in stcril 
w-atcr. at the foliowing levels: 

phenobarbitol 3.81 mg/kg/da> 

gabapentin 34.29 nig/kg/da\ 

vigabatrin 150 nig/kg''da> 

paraldehyde 77.08 mg.''kg/da\. 

These dosages correspond to the ED 100 (the upper limit of the effective dose for luitnans) 
adjusted for the difference in metabolic rate between rats and huinaiis. Three rats were used tor each 
drug treatment, and an additional tiiree rats to match each drug were treated with sterile water to serve a> 
a control. 

Rats were sacrificed 24 hours after the llnai dose and their brains were harv ested. Collection of 
mRNA. synthesis of cDNA. and differential displa\ protocols were carried out according to methods 
described in L". S. Patent No. 5.871,6^7 and Sliimkets cf al. 1999 (Nature Biotechnology 17:798-803). 

The follow ing steps were followed to anal\ze the differential display pattern: 

1 . The intensities 1^,(\) for eacli of the three animals treated w ith the same drug were combined 
into a single average 1,,(\). where the subscript a labels the drug. The standard de\ iation sj.\) was also 
computed for the measttrements from the individual animals treated with the druu. 

2. The averages and standard deviations s,,(\) for each drug were compared with the 
average I^,(\) and standard deviation s,,(.\) for the sterile water control treatment. A difference at lengtli 
.\ was marked if 

ABS(ln[l,,(xVljx)J)>ln(1.5) (5) 

and if tlie signiticance was smaller than 0. 1 5 for a two-tailed t-test with 

t = [Ijx) - I,Xx)l / [1 s,(.\)- ^ s,(x)' I ■ 2 ]■ - (6) 

and infmiie degrees of freedom. The difierence intensities marked according to this procedure 
may then be inspected by eye and visuall\ significant differences ma> be retained. 
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3. For each of the difterciices J, dctnicd by a restriction enzyme pair /- and a position \. the 
intensity ^- I,,, was determined for each of tiie drug treatments, whether or not that particular 
treatment has a difference compared to tlie control. 

hi this example, the final data matrix 1,^, has 8 rows: 1 row for each of the 4 drugs, and 1 row for 
each of 4 replicates of the water control data. The matrix has as many columns as tlie number of 
differences detected in the differential display pattern. 

The Pearson correlation coefficient 0,^ between the 8 classes of samples (4 drugs. 4 water 
controls) was determined using methods provided in the Detailed Description of the invention. If a data 
element for a particular difference was missing for a particular treatment, that difference did not 
contribute to the correlation coefHcient. The cotrelations are shown in the Table I below, with the 
standard deviation within a drug shown as the diagonal elementi. 

Table 1 . 





vtaapatrir. 


Dfienobarocol 


gaoape'^rin 


paraldehyde 


wa;er^ ; 


water_2 


waier_3 


wa:e' 4 


»/igabatnn 


26 1 35C" 


0 9932 


0 99G 


09728 


06545 


0 6573 


0 3674 


0 3735 


pnenobarbto' 


0 998* 


263 i469 


0 9973 


0 9325 


0 5675 


0 4724 


0 65G9 


0 6601 


gabapenttn 


0 99G 


0 997: 


556 8 1V1 


09S22 


0 6 T77 


0 5057 


0 3400 


0 2704 


paratdenyae 


0 9726 


0 9325 


0 9922 


423 t916 


06465 


0 5307 


0 5952 


0.5702 


waier_l 


0 6548 


0 5573 


06177 


0 6486 


594573 


O97'o 


0 9896 


09735 


water_2 


0oo73 


0 472J 


0 5057 


0.5307 


0 9713 


62,3568 


0 995 C 


0 9836 


water_3 


0 367J 


0 6569 


0 3400 


0 5952 


0 9895 


0.9960 


107 2630 


0 997S 


water_4 


0 3735 


0 660* 


0 2704 


0 5702 


0 9735 


0 9836 


0 9978 


123 0743 


Next the pairwise Pearson 


distance 


was calculated 


as described pre\ ious!> . The distance i 


is shown in Table 2 below. 






















Table 2. 












vigabatrin 


Dneno barrxtc i 


gaDapep;(n 


paraldehyde 


wa(e'-_ '. 


wate'_: 


■.va!e''_3 


Aatef_4 


V tgaDatrin 


C OOOC 


0 3597 


0 t3l9 


0 2333 


0 830:- 


0 S'5~ 


I 1243 


1 1-4? 


onenosaroito i 


0 C5S7 


0 OOOC 


0 0741 


0 3675 


0 9297 


1,0272 


0 8234 


0 3245 


qaoapenur 


0 jr: 


0 0741 


0 0000 


0 '24 5 


0 S7d.i 


0 83S: 


1 '489 


t208C 


paraldehyde 


0 233-.^ 


C 3575 


0 1245 


0 OOOC 


0833-: 


0 9633 


0 3997 


0 927: 


water_' 


C 330^^ 


0 9297 


0 8744 


0 8334 


0 coc:- 


0 2376 


0 V14d 


0 23C: 


water_2 


0 3157 


1027: 


0 8830 


0 9588 


0 237-:^ 


0 0C30 


0 0895 


0 60-:j 


water_3 


1 048 


0 8284 


t 1489 


0 8997 


0 '446 


0 0895 


OOOOO 


0 0563 


w3ier_4t 


1 r,is 


0 3245 


12080 


0 927- 


0 2302 


0 '509 


0,0653 


OOOOO 



The distances uere then used as input to a nearest-neighbor clustering algorithm. The resuitinu 
clusters, using sterile ITO as an outgroup. uas shown in Fig. 3. The horizontal distances in Fiu. i A wer 
proportional to the pairwise Pearson distance between clusters. 

The correlation matrix C,., also served as the starting point tor principal factor analvsis. First, 
principal components were calculated using the inner product matrix from multidimensional scaling 



B = M C hi 



(2) 
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where C is tlie correlation matrix and 1 1 is the centering matrix. The k'^ principal component is 
then the k*^ eigenvector of B normah/:cd to unit length and ordered by decreasing eigenvalue a,^. and tiie 
k'^ principal factor v\as obtained b> scaling the eigenvector by ).^^ Projections of the treaiments and 
controls onto principal factois ate s!io\s n in Table 3 belov\. 

Table 3. 



factor 


1 




3 


4 


5 


6 


7 




eigenvalue 


1,841 


0.347 


0.093 


0 036 


0 00; 


0 ooc 


-0 036 


-0 389 


vigabatrm 


-0 513 


-0 «3 


-0.099 


0 083 


0 004 


ooco 


0 000 


0 000 


phenobarbitoi 


-0 397 


0 3T7 


-0 -02 


-0 038 


-0 OS 


0 000 


0 000 


0.000 


gaDapentin 


-0 580 


•0.157 


0.028 


-0 094 


-0 004 


0 000 


0 000 


0 000 


oaraldehyde 


-0 404 


0 27 


0 T9S 


0057 


0 034 


OOOD 


0.000 


oooc 


water_ 1 


0 363 


-0.-69 


0 109 


0 0!7 


-0 060 


0 000 


0.000 


0 OCO 


water_2 


0 422 


-0.3 CO 


■0.091 


-0 ot: 


0 025 


0 OCO 


0 000 


0 000 


water_3 


0 542 


0.155 


0 04*3 


-0 090 


0 0!2 


0 OCO 


0 000 


0 000 


waterd 


0 561 


0 190 


•0 05^^ 


0082 


0 001 


OOOv 


0 ODO 


0 ooc 



The components are ordered from t (most informative) to 8 (least informative). The negaii\e 
eigenvalues ari.^e from the method used to account for missing data. If missing data had been handled in 
an alternate manner, tor example if a missing element had been set to the average value or if the anaKsis 
were restricted to differences for which no data was missing, the eigenvalues would all be non-negative. 

In Fig. 4. the treatments are displa\ed b\' projection onto principal factors. I-acior 1 
discriminates belween drugs, where it has a negative value, and controls, where it has a positive \alue. 
Factor 2 discriminates betweeti the drug treatments. 

EQUIVALENTS 

Froiu the foregoing detailed description of the .specitlc embodiments of the invention, it should 
be apparent that unique methods for representing the extent of relatedness between cells, cell lines, 
tissues, organs, or expressed sequences based on a genomic ana U sis oi'gene expression have been 
described. .AUhotigh paaicular embodiments have been di.sclosed herein in detail, this has been done bv 
uav- of example for purposes of illustration onl>. and is not intended to be limiting with respect to the 
scope of the appended c'-Mms which follow. In particular, it is contemplated b\ the inventor that various-- 
substitutions, alterations, and modifications ma\ be made to the invention without departing from the 
spirit and scope of the invention as detlncd b\ the claims. For instance, the choice of source material, 
subsequences used, or software algorithm used is believed to be a matter of routine for a person of 
ordinary skill in the art with knowlediie of the embodifuenis described herein. 
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CLAIMS 

I claim: 



1 . A method for generating a representation of tlie extent of relatedness between at least two 
classes of cells, wherein the cells in eacii class are chosen from the group consisting of cells of a uiven 
cell t>pe. cells from a given tissue, and cells from a given organ, the method comprising the steps of 

a) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence; 

b) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to vvhich each fragment is present: and 

c) determining the extent of relatedness refiecting similarities or differences in the presence and 
quantitation of the fragments among the classes. 

2. The method described in claim 1 w herein the determining of the presence and quantitation of 
the fragments described in step b) is carried out by a process comprising the steps of: 

i) digesting samples of the nucleic acid from the cells of each class w ith a plurality of specific 
pairs of restriction endonucleases. each sample being treated b> one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, iivbridi/ing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (a) a sliorter strand ha\ inu no 5" 
terminal phosphate and consisting of a first and second portion, said first ponlon being at the 5' end and 
being compiementary to the overhang produced by one of the restriction endonucleases of the pair, and 
(b) a longer strand having a 3' end complementary to the second portion of the shorter strand, and 
ligaiing the longer strands to the fragments to produce iigated fragments, wherein each iigated fragment 
is capable of generating an output signal: 

ii) generating output signals from each Iigated fragment for each of the pairs of restriction 
endonucleases, each output signal characterizmg (a) the subsequences of the pairs of restriction 
endonucleases (b) the length between the two subsequences corresponding to the two restriction 
endonucleases employed in each pair of nuclease^. and {ci the quantitation of the fragment 
corresponding to the pair and the length: and 
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iii) optionally search inu a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from tlie database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals, 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide length 
between the pairs, 

3. The method described in ciaun 1 w herein tiie determining of the presence of the fragments and 
the quantitation of the fragments, described in step b) is carried our b> a process comprising the steps of: 

i) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, w herein the tlrst primer is complementary to the tlrst 
subsequence and the second primer is eomplementar\' to the second subsequence: 

it) amplifying the nucleotide sequence between the tlrst subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an ampiicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to the two 
primers emplo\ed in each pair and a quantitation of the extent to wiiich each ampiicon is present; and 

iii) generating output signals for each ampiicon. each output signal characterizing (a) the 
subsequences of the pairs of primers, (b) the length, and (c) the quantitation: and 

iv) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same sieis of target nucleotide subsequences represented b> tiie one or more output signals. 
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thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

4. The method described in claim 1 wherein the extc.i of relatcdness in step c) is provided b\ 
calculating; a distance wherein the distance reflects the amplitude of a difference vector that is a 
difference betw-een a first vector which reflects information derived from the quantitation for each 
subsequence pair obtained for the first class and a second vector which reflects information derived from 
the quantitation for each subsequence pair obtained for the second class, wherein different elements of 
each vector relate to data obtained using different pairs. 

5. The method described in claim I wherein the extent of relatedness in step c) is provided b\ 
generating a tree structure reflecting the relatedness between any two classes, wherein the branches of 
the tree structure reflect the difference vectors and the branches are ramifled from nodes. 

6. The method described in claim I wherein the cells in at least one class arc cancer cells. 

7. The method described in claim 1 wherein the cells in at least one class have been contacted 
w ith a putativ e pharmaceutical agent. 

8. A method for generating a representation of the correlation between a plural itv of classes of 
cells wherein the cells in each class are chosen from the group consisting of cells of a given cell ivpe. 
cells from a given tissue, and cells from a given organ, the correlation reflecting a change in the nature 
and amount of nucleic acids present in the classes, the method comprising the steps of: 

a) deflning a plurality of pairs of nucleotide subsequences, each pair consisting of a flrst 
subsequence and a second subsequence: 

b) m the nucleic acid of each class of cells determining the presence of a fragment with the first 
sub.sequence at one end and the second subsequence at another end and having a lemith separated b> the 
first and second subsequences, and a quantitation of the extent to which each fragment is present, thereb> 
defining a difference between the classes: 

c) evaluating the correlation between the cells of the classes: and 

d) preparing a representation of the correlation. 
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9. The method described in claim 8 wherein tiie determining of the presence and quantitation of 
the fragments described in step b) is carried out b> a process comprising the steps of; 

i) digesting samples of the nucleic acid from tlie ceils of each class with a pluralit\ of specitlc 
pairs of restriction endonucieascs. each sample being treated by otie pair, one nuclease of the pair 
targeting tlie first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (a) a shorter strand having no 5* 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5" end and 
being complementaiy to the overhang produced by one of the restriction endonucleases of the pair, and 
(b) a longer strand having a 3' end complen itary to the second portion of the shorter strand, and 
liuating the longer strands to the fragments to produce ligatcd fragments, wherein each ligated fragment 
is capable of generating an output signal; 

ii) generating output signals from each ligated fragment for each of the pairs of restriction 
endonucleases, each output signal characterizing (a) the subsequences of the pairs of restriction 
endonucleases (b) the length between the two subsequences corresponding to the two restriction 
endonucleases emploxed in each pair of nucleases, and (c) the quantitation of the fragment 
corresponding to the pair and the length; and 

iii) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that arc predicted to produce the one or more 
output signals produced b\ the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or tnore output signals when the sequence 
from the database has both (a) the same length between occiu^rences of taiget miclcL/iide subsequences as 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented b\ said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented b\ the one or more output signals. 

therebv providing a quantitative measure of the extent to which tiie nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide length 
between the pairs. 

10. The method described in claim 8 wherein the determining of the presence of the fragments 
and the quantitation of the fragments, described in step b) is carried out by a process cotnprising the step^ 
of: 
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i) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primer. v\ herein the fu st primer is complementary to the first 
subsequence and the second primer is compiementarv to the second subsequence; 

ii) amplifying the nucleotide sequence between the fust subsequence and the second 
subsequence using the oligonucleotide primers to prime the atnpliHcation, providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to the two 
primers employed in each pair and a quantitation of the extent to which each amplicon is present; and 

iii) generating output signals for each amplicon. each output signal characterizing (a) the 
subsequences of the pairs of primers, (b) the length, and (c) the quantitation; and 

iv) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or tiie absence of an>' sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the celts of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented b\ the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented b> the one or niorc output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

! i . The method described in claim 8 wherein the correlation in step d) is related to a set ol" 
orthonormal eigenvectors, the elements of the basis set upon which the eigenvectors are constructed 
retlecting particular biochemical or physiological pathways correlated between the cells '^f the two 
classes, each eigenvector having an eigenvalue that is an integer greater than zero, the coeOlcients of the 
basis set elements in each eigenvector whose eigenvalue is less than or equal to a particular integer that 
is an upper limit of tiic eigenvalues used reflecting the contribution of the corresponding pathwav to the 
biochemical or physiological differences correlated betw een the cells of the first class and the cells of the 
second class. 



wo 00/15851 - 32 - PCT/US99/21525 

12. The nictliod described in claim 8 wherein the representation is a cluster diagram or a 
dendrogram, inckidcs a tree structure reflecting the relatedness of the patluvavs involved in the 
biochemical or physiological response to a difference bet^vcen cells of the two classes, wherein a 
correlation matrix pro\ ides a distance determination wherein the distance reflects the amplitude of a 
difference v ector that is a difference between t\^o vectors each of which reflects information obtained for 
the response of one of the two classes to the difference, and wherein the branches of the tree structure 
reflect the difference vectors and the branches arc rami lied froni nodes. 

13. The method described in claim 8 wherein the cells in at least one class are cancer cells. 

14. The method described in claim 8 wherein the cells in at least one class have been contacted 
with a putative pharmaceutical agent, and the method comprises the steps of 

a) treating the cells of at least one class with an amount of the agent sufficient to effect a change 
in the state of those cells or with an amoimt of the agent less than or equal to a predetemiined upper limit 
of dosing concentration; 

b) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
stibsequence and a second subsequence; 

c) in the nucleic acid of each class of cells determining the presence ofa fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to v\hich each fragment is present, therebv 
defining an effect of the agent; 

d ) e\aiuatifig the correlation between tlie effect of the agent on the cells of the first class and the 
effect of the agent on the cells of another class; and 

e) preparing a representation of the correlation. 

15. /\ display means displa\ ing a representation of the extent of relatedness between at least two 
classes of cells, wherein the cells in each class are chosen from the group consisting of celts ofa g!\cn 
cell t\pe, cells from a given tissue, u.id cells from a given organ, the extent of relateuness reflecting, in 
the nucleic acids of the classes of cells, similarities or differences iti the presence of pairs of nucleotide 
subsequences, each pair consisting ofa first subsequence and a second subsequence, a nucleotide length 
separating the first and second subsequences of the pair and a quantitation of the extent to which each 
pair ha\ ing the determined length is in the clashes of cells. 
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16. The display means described in claim 15 wherein the extent of relatedness is related to a 
distance wherein the distance reflects the amplitude of a difference vector that is a difference between a 
first vector which retlccis information derived from the quantitation for each subsequence pair obtained 
for the Hrst class and a second vector which reflects informatioti derived from the quantitation for each 
subsequence pair obtained for the second class, wherein different elements of each vector relate to data 
obtained using different pairs. 

17. The display means described in claim 15 wherein the representation includes a tree structure 
reflecting the relatedness between any two classes, and wherein the branches of the tree structure reflect 
the difference vectors and the branches are ramified from nodes. 

18. The display means described in claim 15 wherein the extent of relatedness is obtained by a 
process comprising the steps of 

a) dellning a pkn*alit\ of pairs of nucleotide subseque[ice:>. each pair consisting of a first 
subsequence and a second subsequence: 

b) in ilie nucleic acid of each class of cells determming the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to which each fragment is present; and 

c) determining the extent of relatedness refiecting similarities or differences in the presence and 
quantitation of the fragments among the classes 

10. The displav means described in claim 18 wherein tlie determining of the presence and 
quantitation of the fragments described in step b) is carried out b\ a process comprising the steps of: 

i) digesting samples of the nucleic acid from the ceils of each class with a plurality of specific 
pairs o\ restriction endonucleases. each sample being treated by one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion pro\ iding specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (a) a shorter strand having no ^* 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5' end and 
being complementarv to the overhang produced by one of the restriction endonucleases of the pair, and 
(b) a longer strand having a 3" end compleincntnrv to the second portion of the shorter stratid. and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each ligated fragment 
is capable of generating an output signal: 
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ii) generating output signals from each ligated fragment for each of the pairs of restriction 
endonucleascs. each output signal characterizing (a) the subsequences of the pairs of restriction 
endonuc leases (b) ttie length between the two subsequences corresponding to the tv\o restriction 
endonucleases empkn ed in each pair of nucleases, and (c) the quantitation of the fragment 
corresponding to the pair and the length: and 

iii) optionaHy searching a nucleotide sequence database to detemiine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, tlie database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or !}iore output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented bv the one or more output signals, and (b) the same target nucleotide subsequence as arc 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented b\ the one or more output signals. 

thereby providing a quantitatise measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments ha\ ing the specific subsequence pairs and the nucleotide length 
between the pairs. 

20. The display means described in claim 18 wherein the determining of the presence of the 
fragments and the quantitation of the tVagments. described in step b) is carried out b\ a process 
comprising the steps of 

i) for each pair of nucleotide subsequences p! o\ iding a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, wherein the first primer is complementar\ to the first 
sub.sequence and the second primer is complementaiy to the second subsequence: 

ii) amolifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to pritne the amplification, providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to 

the two primers cmplo\ed in each pair and a quantitation of the extent to which each amplicon is 
present: and 

iii) generatii\g output signals for each amplicon. each output signal characterizing (a) the 
subsequences of the pairs of primers, (b) the length. ;md (c) the quantitation: and 
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iv) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, llie database comprising a 
plurality of known nucleotide sequences of nucleic acids thai ma> be present in the cells of each class, a 
sequence from the database being predict;.. produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as are 
represented by said one or more output signals, or targer nucleotide subsequences that are members of 
the same stels of target nucleotide subsequences represented t>y the one or more output signals. 

thereby providing a quantitati\e me-^^ure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

21. The displa\ means described in cir.::n 15 wherein the cells in at least one class are cancer 

cells. 

22. The displa\ means described in claim 15 wherein the cells in at least one class ha\e been 
contacted with a putative pharmaceutical agent. 

23. A displa\ means displaying a representation of the correlation between a plurality of classes 
of cells, wherein the cells in each class are chosen from the group consisting of cells of a given cell tvpe. 
cells from a given tissue, and cells from a given organ, the correlation reflectmg. in the nucleic acids of 
the classes of cells, differences in the presence of a pair of nucleotide subsequences, each pair consisting 
of a Hrst subsequence and a second subsequence and the nucleotide length separating the first and second 
subsequences of the pair, and a quantitation of the extent to which each pair lia\ ing the determined 
length is present in the cells, between the c!asse>. 

24. The display means described in claim 2."^ wherein the correlation is related to a set of 
orthonormal eigenvectors, the elements of the basis set upon which the eigenvectors are constructed 
reflecting particular biochemical or physiological pathways correlated between the cells of the iavo 
classes, each eigenvector ha\ ing an eigenvalue that is an integer greater than zero, tnc coefficients of the 
basis set elements in each eigenvector whose eigenvalue is less than a particular integer that is chosen to 
be an upper limit of the eigenvalues reflecting tlie contribution of the corresponding pathwav to the 
biochemical or physiological difTerences correlated between the cells of the first class and the cells of the 
second class. 
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25. The display means described in claim 23 wliercin llic representation is a cluster diagram or a 
dendrogram and includes a tree structure reneciing the relatedness of the pathways involved in the 
biocliemical or ph> siological difference betueen cells of tlie t\vo classes, wherein a correlation matrix 
provides a distance determination wherein the distance reflects the amplitude of a difference vector that 
is a difference between two vectors each of which reflects information obtained for the difference 
between the classes, and wherein the branches of the tree structure reflect the difference vectors and the 
branches are ramified from nodes. 

26. The display means described in claim 23 herein the correlation is obtained b\' a method 
comprising the steps of 

a) dctlning a plurality of pairs of tiucleotide subseqtiences. each pair consisting of a first 
subsequence and a second stibscqtience: 

b) ) in the nucleic acid of eacii class of cells determining the presence of a fragmeiu with the first 
subsequence at one end and the second subsequence at anotlier end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to uhich each fragment is present. lhereb> 
defining a difference between classes: 

c evaluating the correlation between the cells of one class and the cells of a second class based 
on the difference between them: and 

d) ) preparing a reproentation of the correlation. 

27. 'fhe dispiav means described in claim 23 wherein the determining of the presence and 
quantitation of the fragments described in step b) is carried out b\ a process comprising the steps of: 

i) digesting samples of the nucleic acid tVom the cells of each class w ith a pluralit\ of specific 
pairs of restriction endonucleases. each sample being treated b> one pair, one luiclease of the pair 
targeting the firsi subsequence and the second nticiease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments. h\bridi/jng double stranded adapter DNA 
molecides to the fragments, each adapter DNA molecule comprising (ai a shorter strand having no 5* 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5' end and 
being complementarv to the overhang produced b\ one of the restriction endonucleases of the pair, and 
(b) a longer strand ha\ ing a 3* end complementarx to the second portion of the shorter strand, and 
ligating the longer strands to tlie fragments to produce ligated tVagmeiUs. wherein each I i gated iragment 
is capable of generating an output signal: 
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ii) gcneratiMg output signals from eacli ligatcc! fiagiiient for each of the pairs of restriction 
endonucleascs. each output signal characterizing (a) the subsequences of the pairs of restriction 
endonucleascs (b) the length between the two subsequences corresponding to the two restriction 
endonucleases employed in each pair of nucleases, and (c) the quantitation of the fragment 
corresponding to the pair and the length; and 

iii) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same siets of target nucleotide subsequences represented h> the one or more output signals. 

thereby providing a quantitativ e measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide length 
between the pairs. 

28. The display means described in claim 23 wherein the determining of the presence of the 
fragments and the quantitation of the fragments, described in step b) is carried out by a process 
comprising the steps of: 

i) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a Hrst primer and a second prnner. wherein the tlrst primer is complementary to the first 
subsequence and the second primer is complementary to the second subsequence: 

ii) amplifying the nucleotide sequence between the tlrst subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplitlcation. providing an ampiicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to the two 
primers employed in each pair and a quantitation of the extent to which each ampiicon is present; and 

iii) generating output signals for each ampiicon. each output signal characterizing (a) the 
subsequences of the pairs of primers, (b) the length, and (c ) tiie quantitation; and 
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iv) optionally searching a tuicleoticie sequence database to deierniiiie sequences that are 
predicted to produce or the absence of any sequences that at e predicted to proditce the one or more 
output signals produced by the nucleic acid from the cells of each class, liic database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals, 

thereby prov iding a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

29, The dispia\ means described in claim 2} wherein the cells in at least one class are cancer 

cells. 

30. The display means described in claim 23 wherein the cells in at least one class ha\e been 
contacted uitli a putative pharmaceutical agent, and the correlation is obtained by method comprising the 
steps of 

a) contacting the cells of at least one class with an amount of the agent suftlcient to effect a 
change in the state of those cells or with an amount of the agent less than or equal to a predetermined 
upper limit of dosing concentration; 

b) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and second subsequence: 

c) in the nucleic acid of each class of cells determming the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated b> the 
first and second subsequences, and a quantitation of the extent to which each fragment is present, iherebs 
defining an effect of the agent: 

d) evaluating the correlation between the effect of the agent between the cells of at least one 
class contacted with the agent and the cells of another class: and 

e) preparing a representation of the correlation. 



wo 00/1 585 1 . 3Q _ PCT/US99/2 1 525 

3 i . A representation of the extent of relatedness between at least two classes of cells, w herein th 
cells in each class are chosen from the group consisting of cells of a given cell t>pe. cells from a given 
tissue, and cells from a given organ, the extent of relatedness reflecting, in the nucleic acids of the 
classes of cells, similarities or differences in the presence of pairs of nucleotide subsequences, each pair 
consisting of a first subsequence and a second subsequence, a nucleotide length separating the first and 
second subsequences of the pair and a quantitation of the extent to which each pair having the 
determined length is in the classes of cells. 

32. The representation described in claim 31 wherein the extent of relatedness is related to a 
distance wherein the distance reflects the amplitude of a difference vector that is a difference between a 
first vector which reflects information derived from the quantitation for each subsequence pair obtained 
for the tlrst class and a second vector which reflects information derived from the quantitation for eacli 
subsequence pair obtained for the second class, wherein different elements of each vector relate to data 
obtained using different pairs. 

33. The representation described in claim 3 1 wherein the representation includes a tree structure 
reflecting the relatedness between any two classes, and w herein the branches of the tree structure reflect 
the difference vectors and the branches are ramifled from nodes. 

34. The representation described in claim 3 1 wherein the extent of relatedness is obtained by a 
process comprising the steps of 

a) deflning a pluralii>* of pairs of nucleotide subsequences, each pair consisting of a flrst 
subsequence and a second subsequence: 

b) in the nucleic acid of each class of cells determining the presence of a fragment w iifi the tlr>t 
subsequence at one end and the second subsequence at another end and havinu a lengtli separated b> the 
flrst and second subsequences, and a quantitation of the extent to w hicii each fragment is present; and 

c) determining the extent of relatedness reflecting similarities or ditYerences in the presence and 
quantitation of the fragments ainouLi the c' sses 
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35. The representation described in claim 34 wherein the determining of the presence and 
quantitation of the fragments described in step b) is carried out b> a process comprising the steps of: 

i) digesting samples of the nucieic acid from the cells of each class with a plurality of specific 
pairs of restrictiuii endonucleases, each sample being treated by one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion prov idifig specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (a) a shorter strand having no 5' 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5* end and 
being comptenientary to the overhang produced by one of the restriction endonucleases of the pair, and 
(b) a longer strand having a 3* end complementary to the second portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each heated fragment 
is capable of generating an output signal: 

ii) generating output signals from each ligated fragmeiu for each of the pairs of restriction 
endonucleases. each output signal ciiaracierizing (a) the subsequences of the pairs of restriction 
endonucleases (b) the length between the two subsequences corresponding to the two restriction 
endonucleases employed in each pair of nucleases, and (c) the quantitation of the fragment 
corresponding to the pair and the length: and 

iii) optionally searching a nucleotide sequence database to determine sequences thai are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucieic acid from the cells of each class, the database comprisim: a 
plurality of know n nucleotide sequences of nucleic acids that ma\' be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences a- 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
tiie same stets of target nucleotide subsequences represented by the one or more output signals, 

thereby pro. iding a quantitative measure of the extent to which the nucleic acid present in tne 
cells in each class contains fragments having tlie specific subsequence pairs and the nucleotide length 
between the pairs. 

36. The representation described in claim 34 wherein the determining of the presence ofthe 
fragments and the quantitation ofthe tVagments, described in step b) is carried out b\ a proce.ss 
comprising the steps of: 
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i) for each pair of nucleotide subsequences pro\ idnig 0 pair of oimonucleotide primers, 
consisting of a first primer and a second primer, wherein the first priiner is complementary to the first 
subsequence and the second primer is complementar\ to the second subsequence; 

ii) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized by tlie subsequence pair, a length between the two subsequences corresponding to 

the two primers employed in each pair and a quantitation of the extent to which each amplicon is 
present: and 

iii) generating output signals for each amplicon. each output signal characterizing (a) the 
subsequences of the pairs of primers, (b) the length, and (c) the quantitation: and 

iv) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stcts of target nucleotide subsequences represented by the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucteotide length between the pairs. 

37. The .epresentaiion described in claim 3 1 wherein the cells in at least one class are cancer 

cells. 

38. The representation described in claim 3 1 wherein the cells in a class have been contacted 
with a putative pharmaceutical agent. 



\VO00/1585l - 42- PCT/US99/2I525 

38. A representation of the correlation between a pUiraliiy of classes of cells, wherein the cells in 
each class are chosen from the group consisting of cells of a gi\en cell type, cells from a given tissue, 
and cells from a given organ, the correlation reHccting. in the luicleic acids of the classes of cells, 
differences in the presence of a pair of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence and the nucleotide length separating the Hrsi and second 
subsequences of the pair, and a quantitation of the extent to w liich each pair having the determined 
length is present in the cells, between the classes. 

39. The representation described in claim 38 wherein the correlation is related to a set of 
orthonormal eigenvectors, the elements of the basis set upon which the eigenvectors are constructed 
reflec ' :g particular biochemical or physiological pathway s correlated between the cells of the two 
classes, each eigenvector ha\ ing an eigenvalue that is an integer greater than zero, the coefficients of the 
basis set elements in each eigenv cctor w hose eigenvalue is less than a particular integer that is chosen to 
be an upper limit of the eigenvalues reflecting the contribution of tlie corresponding pathwa\ to the 
biochemical or ph\ siological differences correlated between the cells of the tlrst class and the cells of the 
second class. 

40. The representation described in claim 38 w herein the represeiv.uion is a cluster diagram or a 
dendrogram and includes a tree structure reflecting the reiatedness of the pathvs:iys involved in the 
biochemical or physiological differences between ceils of the two classes, wherein a correlation matrix 
provides a distance determination wherein the distance reflects the amplitude of a difference vector that 
is a difference between two vectors each of which reflects information obtained from one of the classes, 
and wherein the branches of the tree structure reflect the difference \ ectors and liie branches are 
ramifled from nodes. 

41. The representation described in claim 38 wherein tlie correlation is obtained by a method 
comprising the steps of 

a) dcflning a pluralit\ of pairs of nucleotide subsequence^, each pair consisting of a flrst 
stibsequence and a ^er-nrj subsequence: 

b) in the nucleic acid of each class ofcells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated bv the 
flrst and second subsequences, and a quantitation of the extent to w hich each fragment is present, therebv 
deflning a difference between classes: 
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c) evaluating the correlation between the cells of one class and llie cells of a second class based 
on the difference between thenr. and 

d) preparing a representation of the correlation. 

42. The representation described in claim 41 wherein the determining of the presence and 
quantitation of the fragments described in step b) is carried out by a process comprising the steps of: 

i) digesting samples of the nucleic acid from tlie cells of each class with a plurality of specific 
pairs of restriction endonucleases, each sample being treated by one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DN.A molecule comprising (a) a shorter strand having no 5" 
terminal phosphate and consisting of a t1rst and second portion, said first portion being at the 5' end and 
being comp!emcntar_\ lo the ov erhang produced by one ot the restriction endonucieases of the pair, and 
(b) a longer strand ha\ ing a 3" end cotnplemetuar> to the second portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each ligated fragment 
is capable of generating an output signal: 

ii) generating output signals from eacfi ligated fragment for each of the pairs of restriction 
endonucieases. each output signal characterizing (a) the subsequences of the pairs of restriction 
endonucieases (b) the length between the two subsequences corresponding to the two restriction 
endonucieases cmplo\ed in each pair of nucleases, and (c) the quantitation of the fragment . 
corresponding to the pair and the length: and 

iii) optional 1\ searching a nucleotide sequence database to determine seqtiences that are 
predicted to produce or the absence ofan) sequences that are predicted to produce the one or more 
output signals produced b\ the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma>' be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences thai are members of 
the same stets of target nucleotide subsequences represented b> the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotitle length 
between the pairs. 
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43. Tlie representation described in claim 41 w herein the determining of tlie presence of the 
fragments and the quantitation of the fragments, described in step c) is carried out by a process 
comprising the steps of: 

i) for each pair of nucleotide subsequences prov iding a pair of ohgonucieotide primers, 
consistitig of a first prii^icr and a second primer, wherein tlie fiist primer is complementary to the first 
subsequence and the second primer is complementar\ to the second subsequence; 

ii) amphf\ ing tfie nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an ampiicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to the two 
printers eniplo> ed in each pair and a quantitation of the extent to w hich cacli ampiicon is present: and 

iii } generating output signals for each ampiicon. eacii output signal characterizing (a) the 
subsequences of the pairs of primer^, fb) the lengtii. and (c) the quaruitation: and 

i\ ) optionally searching a nucleotide sequence database to determine sequences tiiat are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced b\ the nucleic acid tVom the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma> be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both ( a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of* irget nucleotide subsequences represented h\ the one or tnore output signals. 

thereby providing a quantitative measure of the extent to which tlie nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

44. The representation described in claim 38 wherein the cells in at least one class are cancer 

cells. 

45. The representation described in claim 38 wherein the cells in at least one class have been 
contacted with a putati\ e pharntaceuiicai agent, and the correlation is obtained bv a method comprising 
the steps of 

a) contacting the cells of at least one class with an amount of the agent sufficient to effect a 
change in the state of those ceils or with an amount of the agent less than or equal to a predetermined 
upper limit of dosing concentration: 
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b) defming a pluralii) of pairs of nucleotide subseqiienees. eacli pair consisting of a first 
subsequence and a second subsequence; 

c) in the nucleic acid of each class of cells detennining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and Inn ing a lengtli separated by the 
first and second subsequences, and a quantitation of the extent to uiiich eacli fragment is present, tlierebv 
defining an effect of the agent; 

d) evaluating the correlation between the effect of the agent between tlie cells of at least one 
class contacted with tiie agent and the cells of another class; and 

e) preparing a representation of the correlation. 

46. A meiliod for generating a geometrical representation between a plural itv of classes of cells 
wherein the cells in each class are chosen tVom the group consisting of cells of a given cell t\ pe. cell.s 
from a gi\en tissue, and cells from a gi\cn organ, tlie re[iresentaiion rellecting a change in the nature and 
amount of nucleic acids present in the classes, the method comprising the steps of: 

a) in the nucleic acid of eacli class of cells, assessing the presence and amount of a nucleic acid 
fragment thereby defming a difference between the classes; 

b) carrying out a geometrical analxsis based on the differences between the cells of the classes; 

and 

c) preparing a representation of the results of tlie analysis. 

47. Tlie method described in claim 46 wherein the geometrical reproetitation is a result obtained 
b\ a principal component analxsis or a principal factor anal>si>, 

48. The method described in claim 46 wherein assessing the presence and amount of a nucleic 
acid fragment described in step a I is carried out b> a process comprising the steps of; 

i) probing the nucleic acid of each class w ith a set of t>ligonucieoiide probes specific for the 
fragment; and 

ii) determining the extent to which each probe binds the nucleic acid: 

thereb)' providing an assessment of the presence and amount of the nucleic acid fragment in the 

class. 

49. The method described in claim 46 wherein as>e^>mg the presence and amount of a nucleic 
acid fragment describetl in step a) is carried out b\ a [^rocc^^ comprising the steps of:: 
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i) dcf ]ning a plurality of pairs of nucleotide subset] ucnces. each pair consist ins of a first 
subsequence and a second subsequence; and 

ii) in the nucleic acid of each class of cells detenu ining the presence of a fragment u ith the first 
subsequence at one end and the second s-'f^-^t-quence at another end and haviuij; a length separated by the 
first and second subsequences, and a quantitatioti of the extent to which each fragment is present, therebv 
defining the difference between the classes. 

50. The method described in claim 49 wherein assessing the presence and quantity of a nucleic 
acid fragment described in step ii) is carried out by a process comprising the steps of: 

(a) digesting samples of the nuclei- ';cid from the cells of each class with a pluralit\ of specific 
pairs of restriction endonucleases. each sample being treated h> one pair, one nuclease of the pair 
targeting the first subsequence and the <>cqoik\ nuclease of the pair targeting the second subsequence, 
each digestion pro\ iding specifc restriction fragments. h\hridi/ing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (Da shorter "trand ha\ inu no 5" 
terminal phosphate and consisting of a first and second portion, said Hrsi portion beinu at the 5" end and 
being compiementaiy to the ov erhang produced b\ one of the restriction endonucleases of the pair, and 
(2) a longer strand having a 3' end compiementar\ to the second portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligated tVagments. wherein each ligaied fragment 
is capable of generating an output signal: 

(b) generating output signals from each ligated fragment for each of the pairs of restriction 
endonucleases. each output signal characterizing ( I ) the subsequences of the pairs of restriction 
endonucleases (2) the length between the two subsequences corresponding: to the two restriction 
endonucleases emplo\ed in each pair of nucleases, and (3) the quaniitation ot'the (Mgmeiu 
corresponding to the pair and the length: and 

(c) optionallv searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of an> sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid tVom the celts of each class, the database comprisitm a 
plurality of know n nucleotide sequences of nucleic acids that nia\ he present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both ( 1 ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleotide subsequence as are 
represented b> said one or more output signals, or tarLiel nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences rcprocntcLl h\ ifie one or more output siunals. 
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thereby providing a quantitative measure of the extent to uhieh the nucleic acid present in the 
cells in each class contains fragments liaving the specific subsequence pairs and the nucleotide length 
between the pairs. 

51. The method described in claim 49 wherein assessing the presence and quantity of a nucleic 
acid fragment described in step ii) is carried out by a process comprising the steps of: 

(a) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primen wherein the first primer is complementary to the first 
subsequence and the second primer is complementary to the second subsequence; 

(b) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to the two 
primers employed in each pair and a quantitation of the extent to which each amplicon is present: and 

(c) generating output signals for each amplicon. each output signal characterizing ( 1 ) the 
subsequences of the pairs of primers. (2) the lengtii. and (3) the quantitation: and 

(d) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma> be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both ( ! ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) tiie same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output ^iLinals. 

thereby providing a quantitative measure of the extent to w hich the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

52. The method described in claim 46 wherein the results of the geometrical analysis are chosen 
from the group consisting of eigenvalues, eigenvectors, and principal factors. 
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53. The rnetliod described in claim 46 wherein the results of analysis in step c) are related to a set 
of orthonormai eigenvectors, the elements of the basis set upon which the eigenvectors are constructed 
reflecting particular biochemical, phssiological or pharmacological components correlated between the 
cells of the two classes, each eigenvector having an eigenvalue, the coefficients of the basis set elements 
in each eigenvector reflecting the coiUribtition of the corresponding biochemical, physiological or 
pharmacological components to the differences between the cells of the first class and the cells, of the 
second class. 

54. The method described in claim 46 wherein the cells in at least one class are cancer cells. 

55. The method described in claim 46 wherein the cells in at least one class are contacted with a 
putative pharmaceutical agent, and the method comprises the steps of: 

a) treating the cells of at least one class w ith an amount of the agent sutTicient to effect a change 
in the state of those cells or w ith an amount of the agent less than or equal to a predetermined upper limit 
of dosing concentration; 

b) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence: 

c) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the e.xtent to which each fragment is present, therebv 
defining an effect of the agent: 

d) conducting a principal component anal\ sis between the effect of the agent on the cells of the 
first class and the cells of another class: and 

e) preparing a representation of the results of the analysis. 

56. A displa\ means displaying a geometrical representation between a pluralit\ of classes of 
cells wherein the cells in each class are chosen from the group consisting of cells of a given cell tvpe. 
cells from a given tissue, and cells from a gi\en organ, the principal component analysis refiecting a 
change in the nature and amount ot nucleic acids present in the classes, wherein tlie representation is 
obtained by a method comprising the steps of: 

a) in the nucleic acid of each class of cells, assessing the presence and amount of a nucleic acid 
fragment thereby defining a difference between the classes: 

b) carrying out a principal component anal\sis based on the differences between the cells of the 
first class and the cells of the second class: and 
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c) preparing the representation oftlie results of the analysis. 

57. The display means described in claim 56 v\ herein the geometrical representation is a result 
obtained by a principal component analysis or a principal factor analysis. 

58. The display means described in claim 56 wherein assessing the presence and an^ount of a 
nucleic acid fragment described in step a) comprises the steps of: 

i) probing the tutcleic acid of each class with a set of oligonucleotide probes specific for the 
fragment; and 

ii) determining the extent to which each probe binds the nucleic acid; 

thereby providing an assessment of the presence and amount of the nucleic acid fragment in the 

class. 

5^. The display means described in claim 56 wherein assessing the presence and amount of a 
nucleic acid fragment described in srco a) is carried out bv a process comprisiim the steps of: 

i) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence; and 

ii) in the nucleic acid of each class of cells determining the presence of a fraument with the tirst 
subsequence at one end and the second subsequence at another end and having a length separated b\ the 
first and second subsequences, and a quantitation of the extent to which each fragment is present, thereby 
defming the difference between the classes, 

60. The display means described in claim 5^ w herein determining the presence and quantitv of a 
nucleic acid fragment described in step ii) is carried out b> a process comprising the steps of: 

(a) digesting samples of the nucleic acid from the cells of each class with a pluralitv ofspecitic 
pairs of restriction endonucleases. each sample being treated by one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising ( 1 ) a shorter strand having no 
terminal phosphate and consisting of a first and second portion, said tlrst portion being at the 5' end and 
being complementary to the overhang produced by one of the restriction endonucleases of the pair, and 
(2) a longer strand having a 3' end complementar\ to the second portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each ligated fraiimeni 
is capable of generating an output signal; 
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(b) generating output signals from each ligated fragment for cacli of the pairs of restriction 
endoniicleases. each output signal characterizing (1) the subsequences of the pairs of restriction 
endonuc leases (2) the length between the two subsequences corresponding to the two restriction 
endonucleases employed in each pair of nucleases, and (3) the quantitation of the fragment 
corresponding to the pair and the length; and 

(c) optionally searching a nucleotide sequence database to determine sequences that arc 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (1 ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals, 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide length 
between the pairs. 

61 . The display means described in claim 59 wherein assessing the presence and quantity of a 
nucleic acid fragment described in step ii) is carried out by a process comprising the steps of; 

(a) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, wherein the first priiner is complementarv' to the tlrst 
subsequence and the second primer is complementar\' to the second subsequence; 

(b) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence usmg the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized by the subsequence pair, a lervah between the two subsequences corresponding to the two 
primers employed in eacii pair and a quantitation of the extent to which each amplicon is present; and 

(c) generating output signals for each amplicon. each output signal characterizing (1 ) the 
subsequences of the pairs of primers. (2) the length, and ( 3) the quantitation; and 
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(d) optionally scarchinL; a mtclootide sequence database lo determine sequences that are 
predicted to oduce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids tiiat ma} be present in the cells of each class, a 
sequence from tlie database being predicted to produce the one or more output signals \vhen the sequence 
from the database has both (!) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleoMde subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output siunals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
ceils in each class contains tiie specific subsequence pairs and the nucleotide length between the pairs. 

62. The display means described in claim 56 wherein the results of tlie anal\sis are chosen from 
the group consisting of eigenvalues, eigenvectors, and principal factors. 

63. The display means described in claim 56 wherein the results of the anal\sis in step c) are 
related to a set of oiihonormal eigen\ ectors. the elements of the basis set upon w hich the eigenvectors 
are constructed reflecting particular biochemical, ph\siological or pharmacological components 
correlated between the cells of the two classes, each eigenvector having an eigenvalue, the coefficients of 
the basis set elements in each eigenvector retlecting the contribution of the corresponding biochemical, 
physiological or pharmacological components to the differences between the cells of the tlrst class and 
the cells of the second class. 

64. The display means described in claim 56 wherein the cells in at least one class are cancer 

cells. 

65. The display means described in ciaim 56 wherein the ceils in at least or.e class have beeii 
contacted with a putative phannaceutical agent, and the representation is obtained bv a method 
comprising the steps of: 

a) treating the cells of at least one class with an amount of the agent sufficient to effect a ehanue 
in the state of those cells or with an amount of the agent less than or equal to a predetermined upper limit 
of dosing concentration: 

b) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence: 
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c) in the nucleic acid of cacli class ofcells determining the presence of a fraaniont with the first 
subsequence at one end and the second subsequence at anotlicr end and havini: a length separated by the 
first and second subsequences, and a quantitation of the extent to which each fragment is present, therebv 
defining an effect of tlie agent; 

d) conducting a principal component analysis between the effect of the agent on the cells of the 
first class and the cells of another class; and 

e) preparing the representation of the results of the analysis. 

66. A geometrical representation between a plurality of classes ofcells wherein the cells in each 
class are chosen from the group consisting of c 's of a given cell type, cells from a given tissue, and 
cells from a given organ, the principal component analysis reflecting a change in the nature and amount 
of nucleic acids present in the classes, the representation obtained b\ a method comprising the steps of: 

a) in the nucleic acid of each class ofcells. assessing the presence and amount of a nucleic acid 
fragment thereby defniing a difference between the classes: 

b) carrying out a principal coniponent analysis based on the differences between the cells of the 
first class and the cells of the second class: and 

c) preparing the representation of the results of the analysis. 

67. The representation described in claim 66 wherein the geometrical representation is a result 
obtained by a principal component anai\ sis or a principal factor analvsis. 

68. The representation described in claim 66 wherein assessing the presence and amount of a 
nucleic acid fragment described in step a) coinprises the steps of: 

I) probing the nucleic acid of each class with a set of oligonucleotide probes specific for the 
iVagment: and 

ii) determining the extent to which each probe binds the nucleic acid: 

thereby providing an assessment of tlie presence and amoimt of the nucleic acid tVagmeni in the 

class. 
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60. The representation described in claim 66 wherein assessing tlic presence and amount of a 
nucleic acid fragment described in step a i is carried out by a process comprising the steps of: 

i) denninga pluraiitv of pairs of nucleotide subsequence^, each pair consisting of a first 
subsequence and a second subsequence; and 

ii) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by tiie 
first and second subsequences, and a quantitation of the extent to which each fragment is present, thereby 
defining the difference between the classes. 

70. The representation described in claim 69 wherein determining the presence and quantitv of a 
nucleic acid fragment described in step ii) is carried out a process comprising the steps of: 

(a) digesting samples of the nucleic acid from the cells oi'each class with a plurality of specific 
pairs of restriction endonucieases. eacii sample being treated b\ one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, h> bridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising ( 1 ) a shorter strand having no 5' 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5' end and 
being complementarx to tiic overhang produced by one of the restriction endonucieases of the pair, and 
(2) a longer strand having a 3" end complementary' to the second portion of tiie shorter strand, and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each ligated fragment 
is capable of generating an output signal; 

(b) generating output signals from eacfi ligated fragment for each of the pairs of restriction 
endonucieases, each output signal characterizing ( 1 ) the subsequences of the pairs oT restriction 
endonucieases (2) the length between the two subsequences corresponding to the two restriction 
endonucieases employed in each pair of nucleases, and (3) the quantitation of the fraument 
corresponding to the pair and the length; and 
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(cl optionall\ searching a luicleotide setiuciicc database to dctcrniinc scc|ucnces tliat ate 
predicted to produce or the absence ornn\' sequences that are predicted to produce the one or more 
output signals prod uced by tiie nucleic acid from the cells ot each class, the database conipribinf a 
phirality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both ( 1 ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stcts of target nucleotide subsequences represented by the one or more output siunais. 

thereby pro\ iding a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments ha\ ing the specific subsequence pairs and the nucleotide len^th 
between the pairs. 

7! . The representation described in claim 6^ wherein assessing the presence and quantit\ of a 
nucleic acid fragment described in step ii) is carried out b\ a process comprisiuiz the steps of; 

(a) for each pair of nucleotide subsequences pro\ iding a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, w herein the tlrst primer is complementary to the first 
subsequence and the second primer is complementary to the second subsequence; 

(b) amplif\ ing the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences correspondimi to the two 
primers empiove.' in each pair and a quantitation of the extent to which each amplicon is present; and 

(c) generating output signals for each amplicon. each output signal characterizing ( 1 ) the 
subsequences of the pairs of primers, (2) the length, and (3) the quantitation; and 

(d) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of an\ sequences that are predicted to produce the one or more 
outpiu signals produced b> the nucleic acid from the cells of each class, the database comprisinc; a 
plurality of known nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both ( 1 ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleotide subsequence as are 
represented b> said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences repieseiiied o\ the one or more output sii:nals. 
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thereby providing a quantitative nieasuie of the extent to u hieh the nucleic acid present in tlie 
cells in each class contains the speciHc subsequence pairs and the nucleotide lengtli hetueen the pairs, 

72. The representation described in claim 66 wherein the results of the anal\sis are ciiosen tVoni 
the gioup consisiiiiiz of eigenvalues, eigenvectors, and principal factors. 

73. The representation described in claim 66 wherein the results of the analysis in step c) are 
related to a set of orthonormal eigenvectors, the elements of the basis set upon which the eigenvectors 
are constructed reflecting particular biochemical, physiological or pharmacological components 
correlated between the cells of the two classes, each eigenvector hav ing an eigenvalue, the coefficients of 
the basis set elements in each eigenvector reflecting the contribution of the corresponding biochemical, 
physiological or pharmacological components to the differences hervveen the cells of the first class and 
the cells of the second class. 

74. 1 he representation described in claim 66 wherein the cells in at least one class are cancer 

cells. 

75. The representation described in claim 66 wherein the cells in at least one cla.ss have been 
contacted with a putative pharmaceutical agent, and the representation is obtained by a method 
comprising the steps of: 

a) treating the cells of at least one class with an amount of the ageiu sufficient to effect a chanue 
in the state of those cells or with an amount of the agent less than or equal to a predetermined upper limit 
of dosing concentration: 

b) defitung a plurality of pan-s of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence: 

c) in the nucleic acid of each class of ceils determtnuig the presence of a fragment with the first 
subsequence a. one end and the second subsequence at another end and hav ing a length separated by the 
first and second subsequences, and a quantitation of the extctu to which each fragment is present, therehv 
defining an effe n of the agent; 

d) conductuig a principal component analvsis between the etTect of the agent on the cells of the 
first class and the cells of another class: and 

e) preparing the representation of the result> of tiie analvsis. 
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76. A method for classifyidg a pkii alitv' of classes of cells or components thereof hicrarchicnil v 
comprising the steps of 

a) measuring relative differences in llie quantit\' of a nucleic acid present in each class of cells to 
provide measurements of differential nucleic acid displa> ; 

b) converting the measurements into distances betu een tlie classes of ceils in a vector space: and 

c) preparing a hierarchical classification amongst the classes based on the vector distances. 

77. fhe method of claim 76 wherein the classiHcation is performed on classes of ceils, wherein 
tlie cells in a class ma\ be cells of a given cell t\pe. ceils from a given tissue, and cells from a given 
organ, ceils exhibiting a particular pathological state, or cells wliich have been contacted witii a putative 
pharmaceutical agent. 

78. riie metiiod of claim 76 wherein tiie classification is performed on a component of the ceils 
in liie classes, wiierein the component comprises a gene, a nucleic acid, or a fracment tiiereof 

7^. The method of claim 76 wherein tiie measuring is carried out b> a procedure chosen from 
the group consisting of differential disptav of nucleic acid fragments, probing for the presence of a 
nucleic acid using an oligonucleotide probe, sequences obtained from expressed sequence taizs (LiSTs), 
assessing restriction fragment length polvmorpiiisms, and assessing amplification fragment length 
polymorphisms 

80. Tlie method of claim 76 whereiti the preparation of the hierarciiical ciassitlcation is carried 
out by a procedure chosen from tiie group consisting of principal component anaiv sis of a correlation 
matrix, principal factor anal\sis ot a correlation matrix, principal component aiia!\sis of a centered inner 
product matrix, and principal factor anal\ sis of a centered inner product matrix. 

81. Tlie method of claim 80 tuUhcr comprising the step of obtaining a distance metric between 
tiie classes iVom a reduced dimensionality geometrical representation. 

82. A displa\ means displaying tiie results of tiie classification obtained by a method described 
in an\ one of claims 76-8 1 . 

83. A method for representing a pluralit> of classes of cells or compcMients thereof geonietricall> 
comprising tlie steps of 

a) measuring reiati\ e differences in the quantit> of a nucleic acid present in each class of ceils to 
provide measurements of differentia! nucleic acid di.^|Ma>: and 
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b) preparing a geometrical representation amongst the classes based on the measurement of the 
differential display. 

84. The method of claim 83 wherein the classificatioM is performed on classes of cells, wherein 
the cells in a class may be cells of a given cell t\ pe, cells from a given tissue, and cells from a siven 
organ, ceils exhibiting a particular pathological state, or celts which have been contacted with a putative 
pharmaceutical agent. 

85. The method of claim 83 wherein the classification is performed on a component of the cells 
in the classes, wherein the component coinprises a gene, a nucleic acid, or a fragment thereof 

86. The method of claim 83 wherein th. .neasuring is carried out by a procedure chosen from 
the group consisting of differential display of nucleic acid fragments, probing for the presence of a 
nucleic acid using an oligonucleotide probe, sequences obtained from expressed sequence taus (ES Ts). 
assessing restriction fragment length pol> morphisms. and assessing amplitication fragment leniith 
polymorphisms 

87. The method of claim 83 wherein the preparation of the hierarchical classification is carried 
out by a procedure chosen from the group consisting of principal component analysis of a correlation 
matrix, principal factor analysis of a correlation matrix, principal component analysis of a centered inner 
product matrix, and principal factor analysis of a centered inner product matrix. 

88. The method of claim 87 further comprising the step of obtaining a distance metric between 
the classes from a reduced dimensionality geometrical representation. 

89. .A. display means displaying the results of the geometrical representation obtained by a 
method described in an\ one of claims 83-88. 

90. A method of presenting the hierarchical relatedness of two or more members of a 
population, the method comprising: 

providing a data set of each member in the population; 

generating a hierarchical classification of said data set: and 

displaying said classification. thereb\' presenting tiie hierarchical relatedness of the members of 
the population. 

91. The method of claim 90, wherem said population is a population of cells. 

92. The method of claim 90. u herein said population is a population of nucleic acid sequences. 
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93. The method ofclaini 90, wherein said population is a popiihition of poi) peptide seqiienees. 

.94. The method ofclaini 90, wherein said hierareiiical classification of any two or more 
members of the population is calculated using a distance method in combination vsith an algorithm. 

95. The method of claim 94, wherein said distance method is a Pearson correlation distance, 
Euclidean distance, Vtanhattan distance, Mahalanobis distance, a pairwise Pearson distance, or a 
Spearman distance. 

96. The method of claim 95. wherein said algorithm is single linkage, averaiie linkage, or 
complete linkage. 

97. The method of claim 90. wherein said data set is the product of an analysis of said members 
of the population that is selected from the group consisting of differential display, serial analysis of i:ene 
expression . expression tagged sequence analysis, restriction fragment length polv morphism. amplitled 
fragment length pol\ morphism. or Northern blot hybridization analysis. 

98. A method of presenting the geometrical relatedness of two or more members of a 
population, the method comprising: 

providing a data set of each member in the population; 

generating a geometrical classification of said data set: and 

displaying said classification, therebs' presenting the geometrical relatedness of the members of 
the population. 

99. The method of claim 9S. wherein said population is a population of cells. 

100. The method of claim 98. wherein said population is a population of nucleic acid sequences. 

101. The method of cl aim 98. wherein said population is a population of poK peptide sequence^. 

102. The method of claim 98. wherein said geometrical classification is generated by analvzine 
a matrix using an algo^'^-^im. 

103. The method of claim 102, wherein said matrix includes a correlation matrix. 

104. The method of claim 103. wherein said correlation matrix includes a Pearson correlation 
matrix, a Spearman correlation matrix, or a pairwise Pearson correlation matrix. 

105. The method of claim 102. wherein said matrix includes a centered inner product distance 

matrix. 
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106. The nietlioJ ofclaim 105, wherein the inner product distance matrix is determined usiim a 
distance calculated by hierarchical classification analysis. 

107. The method ofclaim 102, wherein said algoritlim includes principal component analysis. 

108. The method of ci aim 102, wherein said algorithm includes principal factor analysis 

109. The method ofclaim 107, wherein said algorithm includes principal factor an.i :>is. 

1 10. The method ofclaim 102, wherein said geometrical classification is further analyzed using 
hierarchical classification. 

111. The method ofclaim 90. wherein said population includes 5, 10. 25, 50, 100, 1000, 10,000, 
100.000 or more members. 

1 12. The method ofclaim 98. wherein said population includes 5, 10. 25. 50, 100. 1000. 10.000, 
i 00,000 or more members. 



wo 00/15851 



PCT/US99/2I52 



1/4 

Experiments 
T 

Measurement 
matrix 

Difference bands 



1 i 

Samples , 



T 



I Measurement 
I to correlation 

J 

Correlate Correlate differences 

samples (can perform further 

analysis as with samples) 

Correlate 
to distance 
(Pearson 
metric) 

T 

< Principal 
components 

1 Scale, rotate 

T 

Principal 
factors 



Fig. 1 

SIBSTITLTE SHEET (RULE 26) 



Measurement to distance , 
(Euclidean metric. 
Manhattan metric, ...) 



Samples 
distance 

Single linkage, 
average linkage, ! 
complete linkage, ... 

T 



Samples 
hierarchical 
clustering 



wo 00/1 585 1 PCT/US99/2 1525 

2/4 



CORRELATION 
MATRIX 



i CENTERED 
JNNER PRODUCT 
MATRIX 



I 



PRINCIPAL 
COMPONENTS 



PRINCIPAL 
FACTORS 



REDUCED- 
DIMENSIONALITY 
REPRESENTATION 



: NOISE-REDUCED DISTANCES FROM 
I SELECTED COMPONENTS OR FACTORS 



Fig. 2 



SlBSTITl TF. snEF.T (RILE 26) 



wo 00/15851 



PCT/US99/21525 



3/4 



0.1 



vigabatnn 



Fig. 3 



water 1 



water 2 



■J 



water 3 



paraldehyde 



gabapentin 



phenobarbitol 



water 4 



SLBSTITITC sun; T (RLfLE26) 



wo 00/15851 



4/4 



PCT/US99/2152 



-0.8 



■a 



-0.4 



0.4 



0.8 



A vigabatrin 

♦ phenobarbitol ' 
■ gabapentin 

o paraldehyde 

• control 



Factor 1 



Fig. 4 



S I B S T I T r T F. S H F. F T ( Rl LF 26) 



INTERNATIONAL SEARCH REPORT 



Ifittmi Application hk> 

PCT/US 99/21525 



A. CUASSinCATlON OFSUBJECT HATTER 

IPC 7 C12Q1/68 G06F17/30 



Accordbg to IntematiooaJ Patent aassfflcatJori (IPC) or to both naAcosl daw^catlai IPC 



a FIELDS SEARCHED 



MnknuTi doamentetion wenched {cia«»iflcatton aystetn fo*ow«d by ciassfflcattco syirixte) 

IPC 7 C12Q G06F 



DoctmentatJon eeorched o*hef than m^hxjn docunefitaHon to the extent thai «uch documentB are Inckxied !n the fWds searched 



Qectronic data ba»e constited dujing th« IntemallonaJ seerch (name of data base and, where pmc4ca), search tenrw used) 



a DocuMEwrs coNaoEnED to be relevamt 



Oteflor/ • CttatJoo o4 docunerrt. wtth lidcatJon, where appropriate, at the relevant peiasagee 



Rdevant to dakn No. 



wo 97 15690 A (CURAGEN CORP) 
1 May 1997 (1997-05-01) 
the whole document 

GUILFOYLE R A ET AL: "Llgation-fliedlated 
PCR amplification of specific fragments 
from class-II restriction endonuclease 
total digest" 

NUCLEIC ACIDS RESEARCH, XP002076198 
the whole document 

-/-- 



1-112 



1-112 



m 



Further docunents are Isted In the ccntiTuatJon c4 bo< C. 



ID 



Patent tamih/ members are Isted kn annex. 



• Spedai categortea o( ctted docunenta ; 

"A" document deftnt>g the general state o( the art which l« not 
conekiered to be o( partkailai rde^yance 

"E" earitef docanent but pubished on Of after the hTtematlonal 
fling data 

"L" doctment whkh may throw doLtte on priority ciairHs) or 
whkh la ctted to eetatllah the pii^^lcetlon date c4 another 
cttation or other speciaJ reason (as specffled) 

'"0" document refenlng to an oral dadoeure. uee, exNbWon or 
other means 

"P" doojTierrt pUJahed pilor to the Hematlonal flkig date but 
(ater than the priority date daimed 



T* later document piiDltehed after the kitematJonal (Dhg date 
or pftertty date and not k\ conflct wtth the appicatJon but 
ctted to mderatand the pctc^ or theory inderiylng the 
kivenHon 

X* docunent of partkxiar reWance; the daJmed hventJon 
camo< be considered novel or cannot be considered to 
hvdve an Inventive step when the document la taken alone 

T' docunerrt of particiiar relevance; the cialmed ^ivention 
cannot be considered to In vdve an kiventtve step when the 
docunent l« comblTed wfth one or more other such docu- 
ments, such combfcialJon being obvious to a perwon sWIed 
the art 

docunert member of the same patent famly 



Date of the actual cctnpietion o< the Wematlonal search 

9 February 2000 


Date 0^ maJIng of the Inteniatlonal search report 

28/02/2000 


Name and maJIkig address o< the ISA 

European Patent Offtee, P.B. 581 8 Patentlaan 2 
NL - 2280 HV Rlj8w1|( 
Te<. (*31-70) 340-2040. Tx, 31 661 epo ri. 
Fax: (+31-70) 340-3016 


Authortzed oflteer 

Hiiller, F 



Form PCT/lSA/210 (s«oond ahe«i) {Jl*v 1W2) 



page 1 of 2 



INTERNATIONAL SEARCH REPORT 



tnttm 1 Application No 

PCT/US 99/21525 



a(Coirtinuation) POCUMEMTS C0W8JDE RED TO BE RELEVANT 



Categwy * CttaHon doconent, with ri dean on. where appropriate, ot the retevent pas»at}e« 



Relevant to deim Na 



KATO K: "DESCRIPTION OF THE ENTIRE MRNA 
POPULATION BY A 3' END CDNA FRAGMENT 
GENERATED BY CLASS IIS RESTRICTION 
ENZYMES" 

NUCLEIC ACIDS RESEARCH, GB, OXFORD 
UNIVERSITY PRESS, SURREY, 
vol. 23, no. 18, 

1 September 1995 (1995-09-01), pages 
3685-3690, XP002008304 

ISSN: 0305-1048 
the whole document 

WO 97 22720 A (BEAHIE KENNETH LOREN) 
26 June 1997 (1997-06-26) 
the whole document 

WO 97 29211 A (US HEALTH ;WEINSTEIN JOHN N 
(US); BOULAMWINI JOHN tUS)) 
14 August 1997 (1997-08-14) 
the whole document 

WO 97 13877 A (LYNX THERAPEUTICS INC 

;HARTIN DAVID W (US)) 

17 April 1997 (1997-04-17) 

the whole document 



1-112 



1-112 



1-112 



1-112 



US 5 508 169 A (DEUGAU KENNETH V 
16 April 1996 (1996-04-16) 
the whole document 



ET AL) 



P.X 



SHIHKETS R.A. ET AL.,: "Gene expression 

analysis by transcript profiling coupled 

to a gene database query" 

NATURE BIOTECHNOLOGY, 

vol. 17, - August 1999 (1999-08) pages 

798-803, XP002130008 

cited in the application 

the whole document 



1-112 



Form PCT/lSA/210 (cor^ijaalcn al 3<cand sheet) (Jiiy IWfi) 



page 2 of 



2 



INTERNATIONAL SEARCH REPORT 

u .nurtion on patttrt famity mtmbcfs 



Intan. \ Application h}o 

PCT/US 99/21525 



Patent document 
cfted In search report 


Pii3llcatk>n 
date 


Patent famil'/ 
niember(a) 


Pii>IcatIon 
date 


wo 9715690 


A 


01-05-1997 


US 

us 

AU 
EP 


5871697 A 
59726 ;3 A 
7476396 A 
0866877 A 


16-02-1999 
26-10-1999 
15-05-1997 
30-09-1999 


WO 9722720 


A 


26-06-1997 


AU 


1687597 A 


14-07-1997 


WO 9729211 


A 


14-08-1997 


AU 


2264197 A 


28-08-1997 



WO 9713877 A 17-04-1997 



us 5508169 A 16-04-1996 



AU 


712929 


B 


18-11-1999 


AU 


4277896 


A 


06-05-1996 


AU 


6102096 


A 


30-12-1996 


AU 


7717596 


A 


30-04-1997 


CN 


1193357 


A 


16-09-1998 


CZ 


9700866 


A 


17-09-1997 


CZ 


9703926 


A 


17-06-1998 


EP 


0793718 


A 


10-09-1997 


EP 


0832287 


A 


01-04-1998 


EP 


0931165 


A 


28-07-1999 


FI 


971473 


A 


04-06-1997 


HU 


9900910 


A 


28-07-1999 


JP 


11507528 


T 


06-07-1999 


JP 


10507357 


T 


21-07-1998 


NO 


971644 


A 


02-06-1997 


NO 


975744 


A 


05-02-1998 


PL 


324000 


A 


27-04-1998 


WO 


9641011 


A 


19-12-1996 


us 


5858656 


A 


12-01-1999 


CA 


2036946 


A 


07-10-1991 



Foim PCT/1SA^10 (patent tajrjy aravx) (Jliy 1 W2) 



